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ABSTRACT 


The  Department  of  Defense  (DOD)  possesses  tremendous  amoimts  of  data 
stored  in  many  large  databases.  Given  the  size  of  these  databases,  large  scale  data 
analysis  tools  are  required  to  find  previously  unknown  and  interesting  patterns.  Data 
Mining  tools  which  produce  output  in  the  form  of  production  rules,  i.e.,  “If  x.  Then 
y”  are  preferred  because  the  generated  rules  are  understandable  by  humans  and 
readily  support  decision  making  processes. 

This  thesis  investigates  the  problems  associated  with  the  statistical  testing  of 
rule  generated  from  data  mining  systems.  Statistical  testing  of  rules  generated  by  data 
mining  systems  is  required  to  ensure  that  the  generated  rules  are  based  on  valid 
statistical  relationships  and  are  not  the  result  of  random  variation  in  the  underlying 
data.  A  strategy  for  the  testing  of  rules  using  a  non-parametric  test  known  as  the 
randomization  test  is  implemented  for  the  testing  of  rules  from  a  prototype  data 
mining  system. 
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I.  INTRODUCTION 


A.  BACKGROUND 

Private  and  Government  organizations  commonly  collect  information  pertaining  to 
the  operation  of  the  organization.  The  amount  of  information  collected  has  increased  in 
recent  years  due  to  automation  and  of  many  routine  business  practices  and  the  decreasing 
cost  of  storage  media  for  holding  the  information.  Banks,  retailers,  scientific  and  engineering 
research  organizations,  and  government  agencies  all  maintain  large  databases  of  information 
collected  in  their  daily  operations. 

Clearly  the  means  available  to  store  information  has  increased.  Organizations  may 
collect  gigabytes  of  information  daily  and  have  databases  on  an  organizational  scale  in  the 
terabyte  range.  However,  many  experts  perceive  that  there  is  a  gap  between  data  generation 
and  data  understanding.  [1]  This  growing  gap  between  the  large  amount  of  data  being  stored 
and  the  ability  to  analyze  and  make  use  of  this  data  has  created  interest  in  large  scale  data 
analysis  tools.  These  tools  are  designed  to  search  the  organizations’  databases  for  new  and 
interesting  patterns  which  were  previously  unknown  and  will  be  of  future  benefit  to  the 
organization.  Applications  for  this  technology  exist  in  marketing,  scientific  research,  and  for 
development  of  expert  systems.  “Data  Mining”  tools  are  now  becoming  available  to  assist 
in  large  scale  data  analysis. 

Data  mining  tools  typically  represent  patterns  in  the  database  in  the  form  of 
production  rules,  i.e..  If  x,  then  y.  The  rules  produced  by  the  data  mining  tool  also  must  have 
a  measure  of  certainty  associated  with  the  pattern  which  corresponds  to  the  number  of 
incorrect  classifications  in  proportion  to  the  number  of  correct  classifications.  Statistical 
hypothesis  testing  is  customarily  employed  to  test  the  validity  of  these  rules  to  ensure  that 
the  induced  rule  is  based  on  statistically  valid  relationship  and  is  not  the  result  of  random 
variation  in  the  underlying  data. 
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B.  RESEARCH  OBJECTIVE 


The  primary  objective  of  this  thesis  is  examine  the  issues  and  problems  associated 
with  statistical  hypothesis  testing  of  rules  induced  by  data  mining  tools.  The  problems  of 
statistical  hypothesis  testing  of  rules  induced  by  data  mining  tools  by  conventional  testing 
methods  are  revealed  and  an  implementation  of  an  alternative  strategy  based  on  a  non- 
parametric  testing  method  known  as  randomization  testing  is  implemented.  Randomization 
testing  methods  are  used  to  test  the  statistical  significance  of  a  set  of  rules  produced  by  the 
Naval  Postgraduate  School  Genetic  Program  (NPSGP)  which  is  a  prototype  data  mining 
system.  The  set  of  rules  produced  from  NPSGP  are  tested  to  see  if  the  rules  produced  from 
NPSGP  are,  in  fact,  better  than  those  which  would  be  expected  to  develop  as  a  result  of 
random  variations  in  the  underlying  data. 

C.  ORGANIZATION  OF  THE  STUDY 

Chapter  I  provides  the  general  introduction  to  the  subject.  Chapter  n  provides 
background  on  the  process  of  data  mining  and  principles  behind  the  genetic  programming 
methods  utilized  in  NPSGP.  Chapter  III  provides  information  associated  with  statistical 
testing  of  rules  induced  by  data  mining  tools  and  presents  the  theory  underlying 
randomization  testing.  Chapter  IV  provides  information  on  the  data  and  methodology  used 
in  the  study  and  Chapter  V  provides  results  of  the  statistical  testing.  Chapter  VI  contains 
conclusions  and  recommendations  of  the  thesis. 
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II.  BACKGROUND 


A.  PRINCIPLES  OF  KNOWLEDGE  DISCOVERY 

1.  Deductive  Inference  versus  Inductive  Inference 

Deductive  inference  and  inductive  inference  are  logical  thought  processes  used  by 
humans  to  learn  facts,  theories,  and  principles.  Although  their  goal  is  the  same,  to  learn 
about  the  world  around  us,  they  differ  in  the  manner  in  which  conclusions  are  arrived  at. 

a.  Deductive  Inference 

Deductive  inference  is  the  development  of  some  theory  or  axiom  which  would 
then  be  proven  or  disproved  by  supporting  or  contradicting  facts.  For  example,  a  theory  such 
as  ‘February  is  the  coldest  month  of  the  year’  would  be  proven  or  disproved  according  to  the 
empirical  evidence  available.  Deductive  inference  in  the  most  basic  form  contains  the 
following  steps: 

1.  Development  of  some  preliminary  hypothesis  or  theory 

2.  Collection  of  observational  statements  of  facts 

3.  Judgement  on  quality  of  a  hypothesis  based  on  collected  facts  and  initial  premises 

b.  Inductive  Inference 

Inductive  inference  is  the  development  of  some  premise  or  theory  based  on 
existing  knowledge.  Given  a  database  of  previous  year’s  monthly  temperature  readings,  a 
researcher  would  then  reach  his  or  her  own  conclusion  which  month  is  the  coldest  of  the 
year.  Inductive  inference  thus  attempts  to  develop  a  theoiy  or  model  of  the  real  world  based 
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on  available  evidence  to  support  the  theory. 

Inductive  inference  in  its  most  basic  form  contains  the  following  steps: 

1 .  Collect  observational  statements  of  facts 

2.  Develop  theory  or  model  based  on  known  facts 

2.  Inductive  Learning 

Learning  is  defined  as  changes  in  the  system  that  are  adaptive  in  the  sense  that 
they  enable  the  system  to  do  some  task  or  tasks  drawn  from  the  same  population  more 
efficiently  and  more  effectively  the  next  time.  [2]  Learning  is  desirable  because  we  seek 
improved  ways  to  coexist  with  our  environment  to  improve  our  welfare.  Inductive  learning 
is  the  application  of  inductive  inferences  to  the  process  of  learning. 

Two  major  types  of  inductive  learning  are  learning  from  examples  (concept 
acquisition)  and  learning  from  observation  (descriptive  generalization).  [3]  Learning  from 
examples  involves  placing  objects  into  logical  conceptual  categories  such  as  weather  = 
sunny,  weather  =  cloudy,  and  determining  on  the  basis  of  observation  the  logical  category 
which  best  fits  the  observation.  Descriptive  generalization  is  the  grouping  of  a  set  of 
observations  into  a  descriptive  group  based  on  some  common  characteristics.  For  example, 
based  on  the  height  measurements  of  a  college  basketball  team,  we  might  reach  the 
conclusion  that  most  basketball  players  are  tall. 
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3. 


Machine  Learning  and  Machine  Discovery 


a.  Machine  Learning 

Machine  learning  is  a  sub-field  of  artificial  intelligence  dealing  with  the 
computer  supported  derivation  of  domain  models  which  are  based  on  knowledge 
representation  paradigms  of  artificial  intelligence.  [4]  Machine  learning  attempts  to 
automate  the  learning  process  to  discover  new  facts,  concepts,  and  theories  given  a  particular 
domain  or  area  of  interest.  The  machine  learning  technique  is  usually  based  on  a 
representative  human  learning  paradigm  such  as  inductive  inference. 

b.  Machine  Discovery 

Machine  discovery  is  a  sub-field  of  machine  learning.  It  is  a  specialization 
of  machine  learning  in  that  it  is  specifically  concerned  with  the  discovery  of  previously 
unknown  information  from  the  provided  data  or  information  which  of  value  to  the  user.  A 
common  machine  discovery  learning  paradigm  is  concept  acquisition  (learning  by  example). 
In  this  paradigm,  given  a  set  of  facts  F,  and  a  predetermined  set  of  classes  S,  the  discovery 
method  attempts  to  associate  some  subset  of  facts  f  =  F  with  a  given  class  s  2  S.  This  is  also 
referred  to  as  learning  by  supervision,  in  that  user  supervision  is  required  to  identify  the 
relevant  classes  of  interest. 


5 


4.  Knowledge  Discovery  in  Databases 


a.  Basic  Concepts  of  Knowledge  Discovery  in  Databases 


Knowledge  discovery  in  databases  (KDD)  is  the  application  of  machine 
discovery  methods  to  information  which  is  stored  in  databases.  Knowledge  discovery  is 
defined  as  “the  nontrivial  extraction  of  implicit,  previously  unknown,  and  potentially  useful 
information  from  data.”[  1  ]  KDD  thus  involves  finding  implicit  patterns  in  the  data  which 
otherwise  may  go  unnoticed.  KDD  discovery  methods  make  use  of  the  inductive  inference 
process  by  associating  sets  of  facts  in  the  database  with  particular  class  outcomes;  the 
discovered  association  is  termed  a  pattern.  Given  the  general  nature  of  the  process,  KDD  can 
be  accomplished  through  a  variety  of  discovery  methods.  Piatetsky-Shapiro,  et  al  [1]  cite 
four  main  characteristics  displayed  by  knowledge  discovery  in  databases. 

1.  High  Level  Language.  Discovered  Knowledge  is  represented  in  a  high  level 
language.  It  need  not  be  directly  used  by  humans,  but  its  expression  should  be 
understandable  by  humans. 

2.  Accuracy.  Discoveries  accurately  portray  the  contents  of  the  database.  The  extent 
to  which  this  portrayal  is  imperfect  is  expressed  by  measure  of  certainty.  The 
patterns  induced  by  the  discovery  methods  will  very  seldom  be  exaet  due  to  values 
entered  in  error,  missing  information,  and  normal  variation  in  the  data. 

3.  Interesting  Results.  Discovered  knowledge  is  interesting  according  to  user- 
defined  biases.  In  particular,  being  interesting  implies  that  the  patterns  are  novel  and 
potentially  useful.  Knowledge  is  useful  when  it  can  help  achieve  a  goal  of  the  system 
or  the  user.  The  biases  or  preferences  of  the  user  may  preclude  some  patterns  with 
high  certainty  from  being  considered  interesting  results. 

4.  Efficiency.  The  discovery  process  is  efficient.  Running  times  for  large  sized 
databases  are  predictable  and  acceptable. 
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b.  Database  Issues 


A  database  is  a  collection  of  related  files.  The  most  basic  database  model  is 
based  is  on  the  aggregation  of  records  associated  with  a  particular  transaction  or  event.  Each 
record  is  referred  to  as  a  tuple.  A  data  dictionary  provides  the  definition  of  each  field  within 
a  tuple  and  allowable  values  for  an  individual  field.  The  database  management  system 
provides  a  means  of  storage,  update,  and  retrieval  of  information  for  a  particular  event  or 
group  of  events. 

Database  users  typically  have  some  domain  knowledge  learned  through 
personal  experience  and  consultation  with  the  data  dictionary.  The  user  can  utilize  this 
learned  knowledge  to  focus  the  search  for  patterns  of  interest.  The  use  of  this  domain 
knowledge  to  focus  the  search  for  patterns  of  interest  is  controversial.  [1]  It  is  controversial 
because  the  use  of  the  domain  knowledge  will  focus  the  search  in  a  particular  direction  and 
possibly  reduce  chances  for  the  discovery  of  new  and  useful  patterns.  For  instance. 
Structured  Query  Language  (SQL)  is  an  efficient  means  for  data  retrieval  to  support  theories 
in  the  deductive  inference  process.  It  is  not  primarily  designed  to  discover  previously 
unknown  information  Unsupervised  searches  will  improve  chances  of  discovery  of  new 
patterns,  albeit  at  a  computational  penalty. 

Databases  are  typically  designed  to  facilitate  records  keeping  of  information 
for  an  organization;  application  of  KDD  methods  may  come  as  an  afterthought.  As  such, 
databases  may  contain  irrelevant  attributes  and/or  attributes  entered  in  error.  This  is 
referred  to  as  ‘noise’  in  the  data  and  is  analogous  to  noise  received  in  communications 
signals  in  that  it  may  cause  erroneous  results.  Missing  values  also  may  occur  and  affect 
outcomes  of  the  KDD  process. 
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Databases  may  only  contain  only  sparse  information  with  which  patterns  can 
be  inferred.  This  sparse  information  may  not  be  enough  to  provide  a  description  sufficient 
to  induce  a  reliable  model. 

B.  KNOWLEDGE  DISCOVERY  METHODS 

This  section  provides  an  overview  of  Genetic  Algorithms  and  Genetic  Programming 
that  are  basis  of  the  knowledge  discovery  methods  studied  in  the  research. 

1.  Genetic  Algorithms 

a.  History 

Genetic  algorithms  are  search  algorithms  which  use  the  mechanics  of  natural 
selection  and  reproduction  as  their  controlling  mechanism.  Genetic  algorithms  were 
developed  by  Holland  (1975).  They  draw  their  logical  basis  from  similarities  to  adaptive 
models  in  the  biological  world.  Genetic  algorithms  are  based  on  the  darwinian  principle  of 
survival  of  the  fittest  in  that  only  the  most  fit  individuals  in  a  species  will  be  successful  in 
finding  food,  water,  and  shelter  to  enable  them  to  survive  and  reproduce.  This  analogy  is 
used  in  grading  a  series  of  potential  solutions  to  maximize  or  minimize  an  objective  function 
corresponding  to  a  solution  to  a  problem  of  interest. 

b.  Theory  of  Genetic  Algorithms 

Goldberg  [5]  asserts  that  genetic  algorithms  (GA’s)  differ  from  other  search 
techniques  in  four  fundamental  ways: 

1.  GA’s  work  with  a  coding  of  the  parameter  set,  not  the  parameters  themselves. 
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2.  GA’s  search  from  a  number  of  points,  not  a  single  point. 

3.  GA’s  use  payoff  (objective  function)  information,  not  derivatives  or  other 
auxiliary  information. 

4.  GA’s  use  probabilistic  transition  rules,  not  deterministic  rules. 

A  typical  GA  is  represented  as  a  string  of  binary  digits.  The  objective 
function  for  the  radius  of  a  quarter  arc  of  a  circle  +  y^  =  r^  over  the  interval  0<x<4  could 
be  represented  by  a  four  bit  binary  string  such  as  [101 1].  The  bits  of  the  string  represent  the 
measurement  of  the  radius  plus  a  constant  one  added  to  the  value  of  the  string.  Thus,  the 
string  [101 1]  represents  the  value  of  a  radius  squared  (r^)  with  measurement  of  12.  Figure 
2-1  is  a  graph  of  problem  domain. 


Figure  2-1. 

The  most  highly  fit  string  would  represent  the  closest  to  optimization  of  the  function  in  the 
problem  domain.  For  this  problem,  the  binary  string  with  a  value  of  sixteen  [1111]  (string 
value  of  15  plus  a  constant  1)  would  represent  the  optimal  value.  Each  string  is  graded 
according  to  a  fitness  function  which  measures  the  ability  of  the  string  to  solve  the  problem 
at  hand.  A  number  of  different  strings,  which  is  commonly  referred  to  as  the  population,  is 
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used  to  conduct  the  search  in  a  parallel  manner.  A  fitness  rating  is  developed  for  each  string 
in  the  population,  and  the  fitness  of  the  population  as  a  whole  is  computed.  Selection  of  the 
next  generation  of  strings  occurs  according  to  a  probabilistic  function  based  on  the  fitness 
of  the  strings.  Highly  fit  strings  have  a  better  chance  for  selection  and  resulting  survival  than 
strings  with  comparatively  lower  fitness. 

c.  Reproduction,  Crossover,  and  Mutation  in  Genetic  Algorithms 

Genetic  algorithms  typically  use  three  operators  which  are  Reproduction, 
Crossover,  and  Mutation. [5]  Reproduction  is  simply  selection  of  the  most  fit  strings.  A 
common  method  for  implementation  is  the  use  of  random  selection  in  the  form  of  a  lottery 
with  the  more  highly  fit  strings  receiving  more  tickets  or  chances  to  be  selected.  In  the 
previous  example,  assume  a  population  of  four  strings.  Table  2-1  shows  each  string  with  its 
fitness  computed;  the  probability  for  selection  for  reproduction  is  based  upon  a  particular 
string’s  fitness  as  a  percentage  of  the  total  fitness  of  the  population.  The  fitness  of  each 
string  is  simply  interpreted  as  its  decimal  value,  e.g.,  [001 1]  is  equal  to  three,  with  no  strings 
exceeding  a  value  of  sixteen  in  the  problem  domain. 


String  # 

String 

Value  of  Fitness 

%  of  Total 

s. 

[1011] 

11 

39% 

Sa 

[0011] 

3 

11% 

S3 

[1000] 

8 

29% 

S4 

[0110] 

6 

21% 

Total 

28 

100% 

Table  2-1 
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A  possible  result  of  the  reproduction  selection  process  is  displayed  in  Table  2-2  ,  where 
string  S,  has  been  selected  twice  and  strings  S3  and  S4  have  been  selected  once. 


String  # 

String 

Value  of  Fitness 

s, 

[1011] 

11 

s, 

[1011] 

11 

S3 

[1000] 

8 

S4 

[0110] 

6 

Total 

36 

Table  2-2 

Reproduction  alone  improves  the  number  of  highly  fit  strings  in  a  population; 
it  may  not  provide  for  optimization  of  the  desired  function  in  that  it  does  not  provide  any 
improvement  to  existing  strings  in  the  population. 

Crossover  is  the  genetic  mating  of  two  strings  in  an  attempt  to  create  a  more 
highly  fit  string  from  the  genetic  material  of  their  ancestors.  Crossover  occurs  by  the  random 
selection  of  a  crossover  position  k  where  A:  is  a  position  between  1  and  /  -  1 ;  /is  the  length 
of  the  string  and  1  is  the  first  position  of  the  string.  The  subsequent  exchange  of  bits 
between  two  strings  takes  place  at  positions  k  +1  and  1  inclusively. 
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For  population  two,  crossover  was  randomly  selected  to  occur  at  position  k=  1 . 
Strings  S,  and  S4  were  randomly  ehosen  for  mating  with  the  results  displayed  as  Table  2-3. 
Crossover  between  string  S,  [101 1]  and  S4  [01 10]  at  position  k=l  results  in  the  swapping 
of  bits  between  S]  and  S4  at  the  second  leftmost  position  in  the  strings;  the  new  value  of  S, 
is  now  [1110]  with  a  value  of  [0011]  for  S4. 


String  # 

String 

Value  of  Fitness 

s, 

[1110] 

15 

S, 

[1110] 

15 

S3 

[1000] 

8 

S4 

[0011] 

3 

Total 

41 

Table  2-3 

Reproduction  and  crossover  have  resulted  in  an  increase  in  the  fitness  of  the 
population  and  the  result  of  a  more  highly  fit  string  than  previously  encountered  in  the  prior 
population.  Genetic  Algorithms  improve  searches  by  (1)  reproducing  high  quality  notions 
according  to  their  performance  and  (2)  crossing  these  notions  with  many  other  high 
performance  notions  from  other  strings  [5].  A  specified  number  of  generations  of  evolution 
for  the  GA  process  are  selected  by  the  user. 

Mutation  is  a  random  operation  involving  the  alteration  of  a  single  bit  or  bits 
in  the  binary  string.  Mutation  provides  a  means  of  change  for  a  string  or  strings  when  all  the 
strings  in  the  population  are  converging  to  a  local  (but  not  global)  maximum. 
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Both  the  string  position  and  population  elements  selected  for  mutation  are  random.  The 
implementation  of  the  mutation  operation  may  or  may  not  improve  the  performance  of  the 
string.  Empirical  genetic  algorithm  studies  suggest  that  one  mutation  per  thousand  bit 
(position)  transfers  may  produce  good  results.  Mutation  rates  are  similarly  small  (or 
smaller)  in  natural  populations  leading  many  researchers  to  conclude  that  mutation  may  be 
considered  as  a  secondary  mechanism  of  genetic  algorithm  adaption.  [5] 

2.  Genetic  Programming 

Genetic  Programming  (GP)  is  a  logical  extension  of  Genetic  Algorithms 
developed  by  Koza.  [6]  In  contrast  to  the  fixed  length  strings  used  by  Genetic  Algorithms, 
GP  uses  actual  program  statements,  in  the  form  of  parse  trees,  to  evolve  a  program  which 
would  provide  a  working  solution  to  some  problem  of  interest. 

a.  Genetic  Program  Constructions 

Koza  uses  a  type  of  expression  in  the  implementation  of  GP  known  as  a 
symbolic  expression  or  S-expression.  The  S-expression  is  an  integral  part  of  the  LISP 
programming  language.  A  LISP  expression,  such  as,  “a  *  {b+c)”  is  equivalent  to  a  parse 
tree  for  a  section  of  a  program.  Such  as  a  statement  would  appear  in  parse  tree  format  as 
depicted  in  Figure  2-2. 

The  unique  capability  of  the  LISP  S-expression  to  serve  as  both  data  and  as 
an  executable  program  statement  was  one  of  the  key  reasons  LISP  was  chosen  for 
implementation  of  GP.  [6]  However,  it  is  also  possible  to  write  genetic  programs  with 
standard  procedural  languages  such  as  C  and  Ada. 
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The  LISP  S-expressions  used  in  GP  fall  into  two  major  categories  which  are 
known  as  the  function  and  terminal  set.  The  function  set  consists  of  domain  specific 
functions  which  the  user  selects  as  appropriate  for  solving  the  problem.  Examples  would  be 
boolean,  arithmetic,  and  other  mathematical  operators  such  as  exponential  functions.  The 
terminal  set  consists  of  arguments  to  the  function  set  during  the  evolution  of  the  genetic 
program.  The  terminal  set  usually  consists  of  relevant  variables  from  the  problem  domain. 

An  important  feature  which  genetic  programs  must  incorporate  is  the  closure 
property.  This  property  prevents  the  evolved  genetic  program  from  taking  on  illegal 
arguments  which  would  result  in  the  execution  of  a  program  statement  ending  in  an  error. 
A  classic  example  of  this  is  a  program  which  attempts  division  by  zero.  The  closure  property 
requires  definition  in  the  function  set  of  a  special  division  operator  to  prevent  this  situation 
from  occurring. 
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b.  Operation  of  a  Genetic  Program 

Genetic  programs  use  the  genetic  algorithm  concepts  of  reproduction, 
crossover,  and  mutation.  The  initial  population  consists  of  a  series  of  randomly  generated 
programs.  As  in  genetic  algorithms,  evolution  of  a  number  of  populations  of  programs 
occurs.  The  crossover  operation  involves  the  swapping  of  S-expressions  between  programs 
in  order  to  evolve  a  more  fit  single  program. 

c.  Application  of  Genetic  Programming  to  KDD 

Genetic  programming  methods  can  be  applied  to  a  wide  variety  of  machine 
discovery  problems,  including  knowledge  discovery  in  databases.  The  fitness  function  used 
by  the  GP  program  to  implement  KDD  would  be  one  designed  to  favor  evolution  of 
programs  which  can  locate  patterns  in  the  database.  The  data  mining  software  used  in  this 
study  which  is  known  as  the  Naval  Postgraduate  School  Genetic  Program  is  an  example  of 
such  a  system. 

C.  REPRESENTATION  OF  DISCOVERED  KNOWLEDGE 

Discovered  knowledge  in  the  KDD  process  may  be  represented  in  the  form  of  rules. 
Rules  represent  knowledge  in  a  form  humans  can  readily  understand  and  use.  A  rule 
language  serves  two  general  purposes.  The  first  is  to  express  discovered  knowledge  found 
by  the  discovery  method.  The  second  purpose  is  to  serve  as  a  prediction  language.  [7] 
Extensions  of  rules  with  basic  probability  concepts  allow  prediction  of  outcomes  when 
previously  unseen  test  data  is  encountered.  The  predictions  are  based  on  previous  outcomes 
which  served  as  the  basis  for  rule  generation. 
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1.  Representation  by  Rules 


a.  Production  Rules 

Rules  induced  from  databases  have  been  traditionally  placed  in  the  form  of 
production  rules  which  are  understandable  and  usable  by  humans.  A  production  rule  is 
interpreted  as  a  condition-action  pair  with  the  context  If  condition  A,  Then  action  B.  Given 
a  set  of  multi-valued  attributes  { Oj,  02,  —A„  }  ^^id  a  series  of  distinct  classes  c„  ,  an 
elementary  description  takes  the  form  If  A  aj  ^  •••  ^  where  a,,  a2  ...  take 

on  some  attribute  value  condition  and  Cj  is  a  class  outcome. 

Rules  can  also  take  on  disjunctive  conditions  such  as  If 
a,  A  02  V  (Oj  A  04),  then  Cl.  Rules  are  typically  expressed  with  some  degree  of  certainty 
expressed  by  a  probability  p(x).  Rules  induced  from  real  world  databases  are  rarely  exact 
due  to  noise  from  erroneous  values  and  normal  variation  in  the  data. 

From  Carbonell  [3],  the  four  basic  operations  whereby  production  rules  may 
be  acquired  and  modified  are  r 

1.  Creation.  A  new  rule  is  created  by  the  system  or  acquired  from  an  external  entity. 

2.  Generalization.  Conditions  are  dropped  or  made  less  restrictive,  so  that  the  rule 
applies  in  a  larger  number  of  instances. 

3.  Specialization.  Additional  conditions  are  added  to  the  condition  set,  or  existing 
conditions  are  made  less  restrictive  so  that  the  rule  applies  to  a  smaller  number  of 
specific  conditions. 

4.  Composition.  Two  or  more  rules  that  were  applied  together  in  sequence  are  made 
into  a  single  larger  rule,  thus  forming  a  “complied”  process  and  eliminating  any 
redundant  conditions  or  actions. 
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b.  Decision  Trees 


Rules  induced  by  the  discovery  method  can  also  be  portrayed  in  the  form  of 
Decision  trees.  A  tree-like  stmcture  is  produced  with  attributes  represented  as  nodes  from 
which  further  branching  can  occur.  Individual  classes  are  represented  as  leaves  of  the  tree 
and  may  be  further  described  using  class  probabilities  or  actual  class  counts. 

2.  Rule  Utility 

Given  a  set  of  patterns  (rules)  induced  from  a  database,  some  rules  will  be  more 
useful  than  others  to  the  user.  This  is  the  notion  of  rule  utility.  The  utility,  which  is 
generally  defined  as  usefulness  of  the  rule  to  the  user,  is  a  subjective  concept  and  will  vary 
among  users  for  a  particular  rule.  The  utility  of  the  rule  applies  to  the  application  of  the  rule 
in  the  context  of  the  set  of  possible  actions.  A  rule  with  high  utility  will  enable  an  action  to 
be  taken  which  results  in  a  gain  or  the  avoidance  of  a  loss.  In  the  section,  we  discuss  rule 
interestingness,  a  measure  of  rule  utility. 

a.  Rule  Interestingness 

Given  a  set  of  induced  rules,  some  rules  will  be  superior  to  other  rules  in 
terms  of  simplicity,  coverage,  and  accuracy  of  the  rules.  Rule  Interest  (RI)  measures  provide 
a  means  for  grading  of  rules  in  accordance  with  some  predefined  criteria.  The  rule  interest 
measure  is  typically  a  statistics  or  information  theory  concept.  RI  measures  are  useful  for 
comparing  the  output  of  different  rule  induction  algorithms,  and  they  can  be  applied  also  to 
the  ranking  of  rules  previously  identified  as  rules  having  some  utility  to  the  user.  Rules  with 
low  coverage  and  certainty  will  not  be  of  much  use  to  the  user  and  therefore  tend  to  be  of  low 
interest. 
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b.  General  Measures  of  Rule  Interestingness 

The  most  basic  rule  interest  measure  used  is  referred  to  as  the  Certainty 
measure.  The  certainty  measure  is  simply  the  number  of  correct  classifications  of  the  rule 
divided  by  the  number  of  examples  in  the  database.  Piatetsky-Shapiro  [8]  suggests  that  all 
rule  interest  measures  should  satisfy  several  basic  principles.  Rules  satisfying  these  proposed 
principles  should  assign  high  values  to  strong  rules  and  lower  values  to  weak  rules.  His 
proposed  measure  of  rule  interest  is  described  as  follows: 

Let  N  indicate  the  number  of  tuples  in  a  database  of  interest.  Let  lAI  be  the 
number  of  tuples  satisfying  condition  A  and  let  IBI  be  the  number  of  tuples  satisfying 
condition  B,  and  let  lA&Bl  be  the  number  of  tuples  with  the  condition  A-  B.  The  proposed 
rule  interest  measures  are: 


1  lA&BI  =  ,  if  A  and  B  are  statistically  independent,  the  mle  is  not  of  interest. 

N 


2.  RI  ^  (monotonically  increases  with)  lA&BI  when  other  parameters  remain  the  same. 


3.  RL  (monotonically  decreases  with)  lA&BI  when  other  parameters  remain  the  same. 
Piatetsky  -Shapiro  suggests  the  simplest  function  which  meets  these  three  criteria  is 


A&.B 


L4I  \B\ 
N 


(2-1) 


This  is  the  difference  between  the  number  of  tuples  containing  lA&BI  and  the  number  of 
tuples  of  lAI  and  IBI  which  would  be  expected  if  A  were  independent  of  B. 


Intuitively,  RI  depends  on  the  coverage  of  the  mle,  i.e.,  what  proportion  of  the  database  does 
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the  rule  classify  and  the  certainty  of  the  relationship,  i.e.,  what  is  the  correlation  of  a 
particular  set  of  attributes  A  with  the  predicted  class  B. 


c.  An  Information  Theoretic  Approach  to  Rule  Interestingness 

Goodman  and  Smyth  [9]  suggest  an  information  theoretic  approach  to 
evaluating  rule  interest.  Given  a  rule,  with  some  set  of  attributes  X  and  a  set  of  object  classes 
Y,  in  the  form  if  (X=x),  then  (Y=y)  with  some  probability  p(a),  then  some  information  is 
provided  by  Y=y  about  X.  This  information  is  used  in  part  for  the  evaluation  of  rule  interest. 

Goodman  and  Smyth  propose  a  rule  interest  measure  known  as  the  J- 
measure.  [9]  J-measure  incorporates  the  simplicity  of  the  hypothesis  and  the  goodness  of 
fit  of  the  class  to  the  attributes  to  come  up  with  a  measure  of  rule  interest.  It  satisfies  the 
concept  of  average  mutual  induction  between  discrete  random  variables  as  originally  defined 
by  Shannon.  The  J  measure  formula  is  defined  as  follows: 

7(X;y=y)  =  piy)  p{x\y)  log  +  (1-  p{x\y))  log  (2-2) 

p{x)  (l-pix)  ) 

The  J  measure  can  be  broken  into  two  fundamental  parts.  The  first  part  p(y)  expresses  the 
coverage  of  the  class  y  in  the  database.  The  remainder  of  the  J-measure  is  known  in 
information  theory  as  the  cross-entropy  of  X  on  the  condition  that  Y=y.  Cross-entropy  is  a 
distance  measurement  between  the  a  posteriori  belief  about  X  and  the  a  priori  belief 
concerning  given  a  particular  outcome  of  Y=y.  The  log  base  of  the  cross  entropy  gives  less 
weight  to  rare  events  (circumstances  where  p(x)  is  low)  as  compared  with  Piatetsky-Shapiro's 
Rule  Interest  measure.  J  measure  thus  has  a  preference  for  events  which  occur  frequently  in 
the  database  and  also  have  a  strong  correlation  between  X  and  Y.  J  measure  can  be  used  to 
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rank  rules  by  information  content  which  appear  to  be  redundant  with  respect  to  the  same 
outcomes  with  the  lower  ranking  rules  being  eliminated. 


Goodman  and  Smyth  explicitly  comment  that  "To  a  large  extent,  ... 
information  based  and  correlation  based  measures,  in  practice,  often  rank  rules  in  a  similar 
order."  [12]  This  is  not  a  surprising  result  since  the  concept  of  mutual  self-information  is 
based  on  the  conditional  probabilities  between  series  of  events  in  a  system  albeit  used  in  an 
informational  (log  based)  manner.  [13]  Specifically,  the  mutual  information  I(  Ej  ,  ) 

between  the  two  events  Ej  and  F,,  is  defined  by: 


/(  Ej,  F,  )  =  log  P 


( Ej  n  F. ) 

P(  E)  Pi  F,) 


(2-3) 


The  mutual  information  in  a  system  would  involve  a  summation  of  the  mutual  information 
for  all  values  of  Ej  and  F,;. 


3.  Strategies  for  Refinement  of  Rules 

The  discovery  method  used  to  extract  patterns  from  a  database  may  discover  a  large 
number  of  patterns.  Many  of  these  patterns  may  be  redundant  or  so  specialized  that  they  are 
not  immediately  useful.  The  user  of  the  knowledge  discovery  method  would  like  to  choose 
the  most  general  and  useful  patterns  for  his  own  use.  A  coherent  strategy  must  then  be 
developed  for  selection  of  the  best  patterns  with  regards  to  certainty,  coverage,  and  novelty 
of  the  pattern. 
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a. 


Production  Rule  Refinement 


Major  and  Magano  [11]  suggest  a  rule  refinement  strategy  to  eliminate  rules 
with  low  interestingness.  Interestingness  is  defined  according  to  four  criteria  of  performance, 
simplicity,  novelty,  and  significance.  Rules  which  are  not  interesting  according  to  these 
criteria  are  dropped  from  the  set  of  rules.  The  strategy  is  implemented  as  follows: 


1.  Performance.  A  performance  frontier  is  defined  consisting  of  a  Cartesian  plane 
with  axes  of  <  coverage,  certainty  factor  >.  If  Rulel  is  at  <G,  C>,  and  Rule2  at  <F, 
B>  with  G  >  F  and  C  >  B,  then  Rulel  is  said  to  dominate  Rule2. 

2.  Simplicity.  The  concept  of  a  mle  lattice  is  introduced  to  provide  an  order  among 
different  rules.  Simpler  (more  general)  rules  subsume  more  specialized  rules  if  they 
both  cover  the  same  concept.  A  rule  with  no  further  generalization  is  said  to  be  a 
Most  General  Rule  (MGR).  Rules  which  provide  specializations  of  the  MGR  are 
said  to  be  in  the  family  of  the  MGR. 

3.  Novelty.  In  a  situation  where  Rulel  and  Rule2  have  overlapping  concepts,  one 
of  the  rules  will  most  likely  be  preferred  to  the  other  on  the  basis  of  scientific 
rationale,  uniqueness,  or  performance.  “A  rule  that  adds  little  insight  or  performance 
to  an  existing  set  of  rules  has  no  novelty  with  respect  to  that  set.”  [11] 

4.  Significance.  For  a  rule  to  be  considered  interesting,  it  must  vary  significantly 
from  other  rules  in  the  rule  set.  A  significance  measure  was  devised  to  take  into 
account  the  complexity  of  the  rule  and  the  statistical  significance  based  on  an 
approximation  of  the  Chi-Square  test.  The  complexity  measure  J  for  a  rule  for  a 
more 


general  concept  Q  and  a  rule  R  for  a  specialization  of  Q  is: 


J  = 


loj 

(T!  (V-7)!)j 

(2-4) 
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where  I  is  the  number  of  instances  of  concept  Q,  G  is  the  number  of  instances 
covered  by  concept  R,  V  is  the  number  of  variables  available  to  formulate 
specializations  of  concept  Q,  and  T  is  the  number  of  conjuncts  present  in  R.  J  is  then 
the  specializations  of  the  more  general  concept  Q  examined  by  the  induction 
mechanism. 


The  significance  is  measured  by: 

5(Rie)  =  -  logjo  (A  J)  (2-6) 


where  A  is  a  numerical  approximation  to  the  one  tailed  significance  level  of  the  Chi- 
Square  test.  An  arbitrary  cut-off  of  2.0  was  established  for  the  domain  researched 
by  the  authors  although  this  could  vary  based  on  the  needs  of  the  user.  The 
significance  measures  favor  rules  with  statistically  significant  differences  and 
simplicity  in  the  structure  of  the  rule. 

Major  and  Magano  define  potentially  interesting  rules  are  those  that  satisfy  the 
performance  criterion  or  are  closely  related  to  mles  that  do.  A  rule  R  is  potentially 
interesting  if : 

1.  R  is  on  the  performance  frontier  or 

2.  Q  is  on  the  performance  frontier  and  R  is  in  the  cone  of  Q  or 

3.  Q  is  a  most  general  mle  and  there  are  at  least  three  frontier  rules  in  Q’s  family,  and 
R  is  also  in  Q’s  family 

Technically  interesting  (TI)  rules  are  selected  among  potentially  interesting  rules 
according  to  simplicity  and  statistical  significance  criteria.  A  rule  R  is  technically  interesting 
if  R  is  potentially  interesting,  and  for  all  Q  that  Q  is  TI  and  R  specializes  Q  and  S(RIQ)  > 
2. 
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Rules  which  are  not  Genuinely  Interesting  are  then  removed.  This  involves 
examination  of  the  TI  rules  which  to  pick  the  rules  which  are  most  applicable  to  the  concept 
of  interest.  Redundant  rules  which  do  not  provide  an  increase  in  performance  are  removed 
from  the  rule  set.  A  domain  expert  is  used  to  pick  the  rules  which  are  the  most  useful  and 
relevant  in  a  scientific  sense.  The  goal  is  to  reduce  the  induced  number  of  rules  to  small 
number  of  rules  with  high  coverage,  certainty,  and  applicability  to  the  concept  under 
consideration. 
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III.  TESTING  OF  INDUCED  RULES 

A.  RELIABILITY  OF  INDUCED  RULES 

Machine  Discovery  methods  such  as  Genetic  Programing  can  be  used  by  managers 
to  find  patterns  in  data  bases  in  the  form  of  rules.  The  rules  are  graded  according  to  rule 
interest  measures  as  previously  discussed.  Before  the  rules  are  used  for  actual  decision¬ 
making  purposes,  testing  of  the  rules  must  occur  to  ensure  that  the  induced  rules  are  based 
on  valid  statistical  relationships.  The  use  of  untested  rules  would  represent  a  risk  by  acting 
on  information  which,  in  fact,  may  be  the  result  of  random  variation  in  the  data  used  to 
induce  the  rule  being  evaluated.  Hence,  statistical  procedures  are  a  necessity  for  assessing 
the  reliability  of  rules. 

1.  Statistical  Testing 

Mathematical  Statistics  is  the  field  of  science  which  deals  with  the  mathematical 
treatment  of  random  occurrences.  Mathematical  statistics  comprises  probability  theory, 
statistics,  and  their  applications.  A  general  application  of  mathematical  statistics  is  to  assess 
the  accuracy  of  models  which  have  been  inferred  from  an  inductive  process.  Probability 
theory  allows  for  the  assessing  of  the  accuracy  of  a  model  based  on  known  information.  The 
testing  of  models  developed  inductively  by  deductive  reasoning  in  the  form  of  statistical 
testing  is  necessary  to  develop  accurate  models.  This  forms  the  basis  of  the  scientific 
method;  a  model  developed  inductively  is  tested  with  what  information  is  known  in  regards 
to  probability  of  the  model.  The  process  is  depicted  in  Figure  3-1. 
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Figure  3-1.  After  Ref.  [12] 

2.  Hypothesis  Testing 

Hypothesis  testing  is  used  to  see  if  two  populations  agree  on  some  common 
parameter.  Normally,  the  hypothesis  that  two  populations  agree  is  termed  the  null  hypothesis 
and  is  denoted  by  Hq.  The  hypothesis  that  the  populations  do  not  agree  is  termed  the 
alternative  hypothesis  and  is  denoted  by  The  test  has  two  possible  outcomes;  p  is 
accepted,  or  Hg  is  rejected.  Since  hypothesis  testing  is  normally  conducted  with  samples 
from  two  different  populations,  there  will  likely  be  differences  in  the  parameter  being  tested 
between  the  two  populations.  Rejection  of  the  null  hypothesis  will  occur  when  results  are 
received  that  are  of  low  probability  under  the  null  hypothesis. 

The  problem  of  what  constitutes  a  low  level  of  probability  is  covered  by  the  notion 
of  significance  level.  Sachs  [12]  addresses  the  concept  of  significance  level. 
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If  a  test  with  a  level  of  significance  of,  for  example,  5%  (significance  level  a  =  0.05) 
leads  to  the  deduction  of  a  difference,  the  null  hypothesis  is  rejected  and  the  alternative 
hypothesis  -the  populations  differ-  is  accepted.  The  difference  is  said  to  be  important  or 
statistically  significant  at  the  5%  level,  i.e.,  a  valid  null  hypothesis  is  rejected  in  5%  of  all 
cases  of  differences  as  large  as  those  observed  in  the  given  sample,  and  such  differences  are 
so  rarely  produced  by  random  processes  alone  that:  a.  we  will  not  be  convinced  that  random 
processes  alone  give  rise  to  the  data  or,  formulated  differently,  b.  it  is  assumed  that  the 
difference  in  question  is  not  based  solely  on  a  random  process  but  rather  on  a  difference 
between  populations. 


3.  Problems  with  Testing  of  Rules  Induced  by  Genetic  Programming 

Genetic  Programming  heuristically  searches  for  patterns  corresponding  to  high  levels 
of  the  specified  fitness  function.  The  program  searches  in  parallel  from  a  number  of  starting 
points  and  each  member  of  the  population  evolves  through  crossover  with  other  members 
of  the  population  through  successive  generations.  The  output  of  the  genetic  programming 
process  is  a  number  of  multiple  independent  models  in  the  form  of  rules  corresponding  to 
patterns  in  the  data  base.  Jensen  [13]  notes  that  problems  occur  when  testing  multiple 
models.  Testing  multiple  models  increases  the  probability  of  finding  an  apparently  accurate 
model  by  chance  alone. 

a.  Labelling  Spaces 

Jensen  illustrated  the  problem  with  testing  multiple  models  through  the  use 
of  labelling  spaces.  A  labelling  is  a  set  of  class  observations  with  each  observation 
corresponding  to  a  label.  The  labelling  space  L  is  an  dimensional  space  where  N  is  the 
number  of  observations.  Any  one  labelling  represents  a  point  in  the  space  of  all  possible 
labellings.  The  scoring  statistic,  which  is  the  fitness  function  in  the  case  of  rule  induced  fi'om 
genetic  programming,  represents  the  distance  d  between  the  actual  and  predicted  labellings. 
Thus,  a  labelling  for  a  rule  with  a  high  fitness  value  would  have  a  smaller  distance  d  than  a 
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labelling  for  a  rule  with  a  lower  fitness  value.  Significance  testing  can  be  viewed  as 
determining  the  probability  p  that  an  arbitrarily  chosen  labelling  fall  within  distance  c?  of  the 
predicted  labelling.  All  such  labellings  fall  within  the  shaded  area  depicted  in  Figure  3-2.  The 
probability  p  is  equal  to  the  ratio  of  the  number  of  labellings  within  the  shaded  area  to  the 
number  of  labellings  within  the  entire  labelling  space  L. 


b.  Testing  of  Multiple  Models 

The  testing  of  multiple  models  is  problematic  in  that  it  increases  the  chances 
of  finding  an  apparently  accurate  model  by  chance  alone.  This  concept  can  also  be  illustrated 
by  labelling  spaces  as  depicted  in  Figure  3-3.  The  points  PJ  ...  P5  represent  individual  models 
in  the  labelling  space. 
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•  P3 

'P4 

• 

A 

*P1 
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•  P2 

Figure  3-3  After  Ref.  [13] 

When  multiple  models  are  tested,  significance  testing  can  be  viewed  as  determining  the 
probability  that  an  arbitrarily  chosen  label  falls  within  distance  d  of  any  of  the  predicted 
labellings.  The  distance  d  is  the  actual  distance  between  the  actual  Labelling  A  and  the  best 
of  the  predicted  labellings  (  Pi  ).  [13]  The  distance  d  is  then  used  as  the  radius  for  the  shaded 
circles  in  Figure  3-4. 
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More  area  is  clearly  in  shade  in  Figure  3-4.  This  increased  area  represents  the  probability 
that  a  particular  model  will  be  found  to  be  statistically  significant.  Jensen  comments: 

Testing  multiple  models  puts  more  labellings  within  reach  of  at  least  one  of  the 
models.  This  increases  the  probability  that  an  apparently  accurate  test  will  be  found  by 
chance  alone.  [13] 

Conventional  parametric  statistical  tests  are  designed  to  test  one  model  at  a  time  and  assume 
that  it  is  the  only  model  tested.  For  this  reason,  use  of  a  conventional  statistical  test  could 
lead  to  misleading  results  when  testing  rules  obtained  from  genetic  programing. 

c.  Testing  of  Correlated  Models 

Jensen  also  notes  that  testing  of  results  from  induction  systems  also  introduce 
an  additional  difficulty  due  to  the  testing  of  multiple  models  with  highly  correlated 
predictions!  High  correlations  between  models  may  result  from  the  models  being  closely 
related  in  form  or  from  the  models  being  induced  from  the  same  data  set.  Correlation 
between  models  decreases  the  probability  of  finding  an  apparently  accurate  model  from 
chance  alone.  [14]  [15]  Jensen  also  explains  this  effect  in  terms  of  labelling  spaces.  Highly 
correlated  predictions  are  closely  grouped  in  the  vicinity  of  the  actual  value  A.  The  most 
accurate  model  has  the  labelling  PL  This  is  depicted  in  Figure  3-5.  Significance  testing 
is  used  to  determine  the  probability  that  an  arbitrarily  selected  labelling  would  fall  within 
distance  d  of  one  of  the  models.  The  correlated  labellings  reduce  the  amount  of  labellings 
within  reach  of  each  individual  model,  thus  lessening  the  possibility  that  a  spurious  labelling 
may  be  included  in  a  particular  model.  This  is  depicted  in  Figure  3-6. 
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B.  RANDOMIZATION  TESTING 


Randomization  Testing  is  a  non-parametric  statistical  test  which  can  be  employed  for 
hypothesis  testing  of  differences  between  two  populations.  Randomization  testing  is  also 
referred  to  in  the  literature  as  permutation  tests.  [16]  Randomization  testing  is  attractive  for 
testing  rules  induced  from  genetic  programming  because  it  makes  no  mathematical 
assumptions  concerning  the  number  of  models  tested  or  the  correlation  of  the  predictions  of 
the  models.  As  per  Jensen  [13],  “Randomization  tests  have  several  advantages  over  other 
approaches.  They  automatically  account  for  the  number  of  observations,  the  number  of 
models  tested,  and  the  correlation  of  the  models.”  Edgington  [17]  extensively  covers  the 
applications  of  randomization  testing  to  a  variety  of  problems. 

1.  Randomization  Testing  Model 

The  randomization  test  is  a  variant  of  Fisher’s  Exact  Test  (1935).  It  is  a  rank  order 
statistic  method  based  on  a  comparison  of  the  actual  experimental  results  with  a  randomly 
derived  set  of  results.  Edgington  [17]  formally  defines  the  randomization  test  as: 

A  statistical  test  for  which  the  significance  is  determined  by  permuting  the  data 
repeatedly  to  compute  t,  F,  or  some  other  test  statistic  is  called  a  randomization  test. 

The  idea  and  methodology  behind  the  randomization  test  is  relatively  simple.  First,  a  test 
statistic  relevant  to  the  problem  is  selected.  The  test  statistic  is  then  computed  for  the  model 
based  on  the  actual  data  set.  The  data  is  permuted  in  a  random  fashion  and  the  test  statistic 
computed  for  each  permutation  of  the  data.  The  test  statistic  for  the  actual  (non-permuted) 
data  is  combined  with  the  test  statistics  from  the  random  permutations.  The  resulting 
distribution  of  test  statistics  is  then  placed  in  rank  order.  To  reject  Hq,  and  conclude  that  the 
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sample  from  the  actual  results  did  not  come  from  the  sample  of  randomly  permuted  results, 
the  performance  of  the  actual  results  must  outrank  the  majority  of  the  randomly  permuted 
results.  Test  statistics  from  the  random  permutations  represent  the  best  which  can  be 
achieved  by  chance  alone.  If  the  test  statistic  from  the  actual  data  set  does  not  outperform 
the  randomly  derived  test  statistics,  it  can  be  construed  as  evidence  to  support  the  null 
hypothesis  that  there  no  differences  between  the  results  of  the  model  being  tested  and  the 
results  which  could  be  expected  from  a  random  process. 

a.  Randomization  Test  Procedures 

As  the  randomization  test  is  designed  to  test  the  differences  between  two 
populations,  the  hypothesis  is  stated  in  a  standard  manner  which  is  similar  to  the  notation 
used  in  conventional  two-tailed  statistics  tests.  Edgington  [17]  discusses  test  design 
implementation  of  randomization  testing.  The  null  hypothesis  is  that  the  measurement  or  set 
of  measurements  associated  with  each  experimental  unit  is  independent  of  the  assignment 
of  units  to  treatments.  In  the  case  of  rules  derived  from  genetic  programming  methods,  this 
means  that  the  performance  of  a  rule  on  some  test  statistic  is  not  significantly  better  than 
could  be  achieved  by  random.  The  steps  to  be  taken  to  ensure  validity  of  the  test  are  as 
follows: 

1 .  Specity  the  test  statistic  before  the  experiment,  and  ensure  that  the  test  statistic  is 
defined  in  such  a  way  that  it  can  be  computed  for  a  data  permutation  without 
considering  whether  the  data  permutation  represents  the  obtained  results. 

2.  List  eveiy  equally  probable  assignment  of  experimental  units  to  treatments.  (This 
step  presupposes  a  random  assignment  to  treatments) 

3.  Within  each  of  the  equally  probable  assignments,  substitute  for  each  experimental 
unit  the  measurement  ( or  set  of  measurements)  for  the  unit  to  provide  a  distribution 
of  data  permutations. 

4.  For  each  data  permutation  compute  the  test  statistic  specified  in  step  1 . 
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5.  Determine  the  population  of  the  data  permutations  with  as  large  a  test  value  as  the 
obtained  test  statistic  value,  and  use  that  proportion  as  the  P-value. 


b.  Achieved  Significance  Level 


The  Achieved  Significance  Level  (ASL)  of  the  test  procedure  is  a  p  value 
based  on  the  proportion  of  data  permutations  with  as  large  a  test  value  as  the  actual  obtained 
test  statistic  value.  Efron  [16]  states  the  formal  definition  of  the  ASL  as 


ASL 


_  #{ 


perm 


B 


(3-1) 


where  0*  is  the  test  statistic  from  the  random  permutations  and  ^represents  the  test  statistic 
from  the  actual  data  set.  B  is  the  number  of  instances  of  test  statistics  in  the  distribution. 
To  reject  Ho,  and  conclude  that  the  test  statistic  derived  from  the  actual  data  set  did  not  occur 
as  a  result  of  a  random  process,  @  should  outrank  ^  5  times  for  a  B  value  of  one  hundred. 
This  would  correspond  toap  level  of  .05  for  the  significance  of  rejecting  the  null  hypothesis. 

c.  Random  Data  Permutation 

Two  methods  exist  for  obtaining  the  random  permutations  of  data  used  to 
conduct  a  randomization  test.  They  are  referred  to  as  systemic  data  permutation  and  random 
data  permutation.  A  systemic  data  permutation  includes  all  permutations  of  the  data  in  the 
randomized  data  set  used  to  conduct  the  test.  A  random  data  permutation  involves  the 
sampling  of  permutations  to  obtain  a  randomized  data  set. 
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A  problem  which  could  occur  with  random  data  permutation  method  is  that 
the  sampled  permutations  are  not  representative  of  a  uniformly  random  distribution.  This 
would  likely  be  the  result  of  Monte  Carlo  error  in  the  sampling  procedure  and  be  more 
prevalent  with  a  small  number  of  permutation  replications.  Efron  [16]  addresses  the  problem 
of  the  number  of  permutation  replications  to  be  used  to  minimize  Monte  Carlo  error  in 
replication  sampling.  Computations  were  performed  to  find  the  number  of  permutation 
replications  required  to  make  the  coefficient  of  variation  of  the  achieved  significance  level 
j.,,(ASL)  <  .10.  Efron’s  results  are  summarized  in  table  3-1. 


ASL: 

.5 

.25 

.1 

.05 

.025 

B: 

100 

299 

900 

1901 

3894 

Table  3-1  From  Ref  [16] 

The  next  chapter  discusses  the  use  of  randomization  testing  for  assessing  the 
statistical  significance  of  rules  induced  from  genetic  programming. 


35 


IV.  DATA  AND  METHODOLOGY 
A.  NAVAL  POSTGRADUATE  SCHOOL  GENETIC  PROGRAM 

The  Naval  Postgraduate  School  Genetic  Program  (NPSGP)  is  an  adaption  of  the 
Simple  Genetic  Program  in  C  (SGPC)  written  by  Tackett  and  Carmi.  NPSGP  modifies  SGPC 
by  providing  functionality  for  data  mining.  Only  a  brief  description  of  the  preparations  for 
use  of  NPSGP  is  provided  here,  a  full  User’s  Manual  is  also  available.  [18] 

1.  Preparations  to  use  NPSGP 

a.  Data  Requirements 

The  data  set  intended  for  data  mining  with  NPSGP  must  be  in  a  tab-delimited 
ASCn  file.  Additionally,  NPSGP  works  best  if  the  following  conditions  are  met: 

1 .  The  target  attribute  (classification)  is  represented  by  non-continuous  (discrete) 

data. 

2.  The  non-target  attributes  in  the  data  set  are  represented  by  a  mix  of  continuous 

and  discrete  attributes. 

3.  All  continuous  attributes  are  linearly  scaled  to  the  same  range. 

b.  Fitness  Functions 

The  fitness  function  in  NPSGP  must  be  translated  in  C  code  to  correspond  to 
a  particular  rule  interest  measure  as  defined  in  Chapter  n.  For  each  rule  evaluated,  NPSGP 
partitions  each  example  in  the  data  set  into  one  of  four  possible  conditions.  The  conditions 
are  displayed  in  Table  4-1. 
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Classification  (RHS)  is  : 

True 

False 

Description  (LHS) 

is  True 

a 

b 

Description  (LHS) 

is  False 

c 

d 

Table  4-1 

The  rule  interest  measure  is  then  coded  to  match  the  components  of  the  table.  For  example, 
the  fitness  function  for  the  certainty  rule  interest  measure  would  be  simply 
(a/ (a  +  b)). 


c.  Description  ofNPSGP  Rules 

A  typical  production  rule  induced  from  a  data  set  used  in  the  study  would 

appear  as: 


Tree  #45 
(IF 

(IN-CTGRY 
GILL_SIZE 
N  ) 

(IN-CTGRY 
CLASSIFY 
P  ) 

) 

Number  of  Records  matched  by  LHS:  2512 
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The  rule  was  developed  from  a  data  set  of  mushrooms  used  in  the  study.  It  is  read  as  “IF  (the 
example  mushroom)  is  in  category  of  the  attribute  GILL_SIZE  =  N,  THEN  the  classification 
of  the  mushroom  is  P  (poisonous).  Information  is  also  given  on  the  number  of  examples  in 
the  data  set  matching  the  on-target  attribute  description  of  the  rule  and  the  number  of  times 
the  rule  provides  incorrect  classifications. 

B.  DATA  SELECTION 

Two  different  data  sets  were  used  for  generation  of  rules  corresponding  to  patterns 
in  the  data  set  for  testing  purposes.  Both  data  sets  are  commonly  used  for  evaluation  of 
machine  learning  methods  by  researchers.  The  data  sets  are  available  through  anonymous  ftp 
from  the  University  of  California  -  Irvine  (ics.uci.edu:  pub/machine-leaming-databases). 

1.  Description  of  the  Mushroom  Data  Set 

The  mushroom  data  set  was  donated  by  Schlimmer  and  contains  8124  examples 
corresponding  to  23  species  of  gilled  mushrooms  in  the  Agaricus  and  Lepiota  families.  Each 
tuple  contains  22  nominally  valued  attributes  and  a  classification  for  the  example.  Each 
example  is  classified  as  definitely  edible,  definitely  poisonous,  or  of  unknown  edibility.  The 
latter  class  was  combined  with  the  poisonous  one.  A  full  description  of  the  data  set  and 
domain  information  is  provided  in  Appendix  A.  Ref  [18]  clearly  states  that  there  is  no 
simple  rule  for  determining  the  edibility  of  a  mushroom;  no  rule  such  as  “leaflets  three,  let 
it  be”  for  poisonous  Oak  and  Ivy. 


a.  Modifications  to  the  Mushroom  Data  Set 

Modifications  were  made  to  the  mushroom  data  set  to  improve  the 
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performance  of  the  NPSGP  data  mining  software  used  in  the  study.  Two  nominally  valued 
attributes  (Gill  Spacing  and  Ring  Number)  were  reformatted  with  continuous  values  over  the 
range  of  0  -100  to  provide  a  mixture  of  discrete  and  continuous  attributes  to  enhance 
performance.  Additionally,  the  data  set  was  modified  to  test  the  ability  of  the  software  to 
find  a  rare  event.  To  facilitate  this,  a  high  quality  rule  in  the  data  set  was  identified,  (If  Odor 
=  none,  Then  mushroom  is  edible),  and  the  852  examples  supporting  this  rule  were  changed 
from  a  classification  of  “e”  to  a  classification  of  “z”.  The  actual  rules  tested  for  significance 
from  this  data  set  were  from  a  related  study  at  the  Naval  Postgraduate  School.  [19] 

2.  Description  of  the  Zoo  Data  Set 

The  zoo  data  set  was  donated  by  Forsyth  and  contains  101  examples  of  zoo  animals 
which  are  divided  into  seven  distinct  classes.  The  classes  are  standard  taxonomic  categories 
such  as  mammals,  reptiles,  birds,  etc.  All  non-target  attributes  are  boolean  with  the 
exception  of  one  continuously  valued  attribute.  The  details  of  the  attribute  names  and  values 
are  provided  in  Appendix  B .  The  actual  rules  tested  for  significance  from  this  data  set  were 
from  previous  research  at  the  Naval  Postgraduate  School.  [20] 

C.  RANDOMIZATION  TESTING  DESIGN 

The  design  of  the  randomization  testing  to  test  the  significance  of  the  rules  induced 
from  NPSGP  involved  the  following  steps: 

1.  Statement  of  the  Hypothesis. 

2.  Computation  of  the  test  statistic  of  the  NPSGP  rules  on  the  actual  data  set.  For 

the  purposes  of  the  study,  the  test  statistic  used  was  the  confidence  of  the  rule. 

3.  Randomization  of  the  class  labels  in  the  data  set. 

4.  Compute  the  test  statistic  of  the  NPSGP  rules  on  the  randomized  data  set. 
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4.  Repeat  steps  2  and  3  for  the  required  number  of  permutation  replications. 

5.  Compute  the  achieved  significance  level  for  each  rule  being  tested. 

6.  Evaluate  the  stated  h5q)Othesis. 

1.  Statement  of  Hypothesis 

The  null  hypothesis  is  that  rules  from  induced  by  NPSGP  are  not  more  accurate  than 
rules  induced  by  random  processes,  or  alternatively  stated,  there  is  no  difference  between  the 
populations.  The  alternative  hypothesis  is  that  there  exists  a  significant  difference  in  the 
accuracy  of  rule  induced  by  NPSGP  as  compared  to  the  performance  of  rules  induced  by 
random  processes.  The  hypothesis  is  formally  stated  as: 

Ho:  F,  =  F2 
Ha:  Fi  ^  F^ 

where  F,  represents  a  NPSGP  induced  rule  and  F2  represents  a  rule  that  is  formed  as  a  result 
of  a  random  process. 

2.  Randomization  of  the  Data  Set 

The  randomization  test  process  requires  the  random  assignment  of  class  labels  to  the 
examples  in  the  data  set  for  each  permutation  replication.  In  the  study,  the  classification 
labels  were  randomly  assigned  in  the  same  proportions  as  they  originally  occurred  in  the  data 
set,  e.g.,  for  the  3916  occurrences  of  the  poisonous  classification  in  the  mushroom  data  set 
were  randomly  assigned  to  3916  examples  in  the  data  set.  This  method  involves  sampling 
the  total  number  of  permutations  of  class  labels.  For  the  mushroom  data  set  of  8124 
examples,  there  are  23,372  permutations  without  replication,  i.e.,  each  example  in  the  data 
set  is  assigned  all  three  of  the  possible  class  labels  with  one  the  labels  being  the  correct  one. 
The  sampling  was  done  to  preserve  the  continuity  of  results  occurring  under  the 
randomization  test  If  by  chance  the  randomization  process  happened  to  correctly  assign  the 
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3916  poisonous  class  labels  to  the  corresponding  correct  examples  in  the  data  set,  then  an 
equivalent  test  statistic  should  be  computed  for  the  actual  and  random  results. 
Randomization  of  class  labels  was  accomplished  by  a  program  written  in  C-H-  for  the  two 
specific  data  sets  used  in  the  study. 

3.  Generation  of  Permutation  Replications 

In  a  permutation  replication,  each  NPSGP  rule  tested  was  scored  on  its  performance 
is  classifying  the  relevant  data  set  with  randomly  assigned  labels  as  previously  described. 
The  permutation  replications  were  generated  by  a  procedure  which  was  executed  in  a  Unix 
C  shell  script.  The  procedure  is  represented  in  pseudocode: 

For  j ...  #  number  of  replications.  Loop 

Randomize  class  labels  in  the  data  set 

Compute  confidence  statistic  for  each  rule  tested 

Record  results  for  permutation  replication  on  randomized  data  set 
End  Loop 

A  utility  program  was  developed  in  C-H-  to  grade  the  confidence  of  each  NPSGP  rule  on  the 
permutation  replications.  As  per  Efron  [16],  2000  permutations  replications  were  selected 
for  each  rule  to  minimize  the  coefficient  of  variance  of  the  achieved  significance  level  to  less 
than  or  equal  to  0. 10. 
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4.  Compute  Achieved  SigniHcance  Level 

The  computed  test  statistics  for  the  NPSGP  rules  on  the  actual  data  set  and  on  the 
permutation  replications  are  arranged  in  rank  order.  The  achieved  significance  level  is 
computed  by  summing  the  number  of  times  the  test  statistic  score  from  the  permutation 
replication  exceeds  the  test  statistic  score  achieved  on  the  actual  data  set  divided  by  the  total 
number  of  test  statistics. 
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V.  ANALYSIS  OF  RESULTS 


This  chapter  presents  the  results  produced  by  the  methodology  described  in  the 
previous  chapter.  Three  different  sets  of  rules  meeting  the  criteria  of  being  technically 
interesting  as  defined  in  Chapter  n  were  tested  for  significance  using  the  randomization  test 
methodology  described  in  the  previous  chapter.  Two  of  the  sets  of  rules  tested  were  from 
the  mushroom  data  set  and  the  other  set  of  rules  from  the  zoo  data  set  previously  described. 
A  sample  result  of  the  testing  method  is  shown  by  the  example  in  Table  5-1.  The 
significance  level  of  the  test  in  Table  5-1  is  .05,  but  given  the  results  of  the  test,  the  same 
conclusion  with  respect  to  the  null  hypothesis  would  have  been  reached  for  a  significance 
level  of  .01. 

A.  SUMMARY  OF  RESULTS 

1.  Testing  on  Rules  from  the  Mushroom  Data  Set 

The  first  set  of  rules  tested  were  from  the  mushroom  data  set  after  the  NPSGP 
program  was  allowed  to  run  for  30  generations.  The  twelve  rules  in  the  set  were  induced 
using  the  confidence  fitness  function.  A  summary  of  the  rules’  performance  on  the  actual 
data  set  is  presented  in  Table  5-2.  The  exact  syntax  of  the  rules  and  the  C-H-  utility  program 
used  to  test  the  rules  is  included  in  Appendix  C.  The  scoring  statistic  for  the  2000 
randomization  permutation  replications  is  the  confidence  of  the  rule  on  each  permutation 
replication.  The  scoring  summary  of  the  results  of  the  2000  permutation  replications  on  the 
twelve  rules  is  presented  in  Table  5-3. 

The  accuracy  of  the  rules  on  the  randomized  data  set  consistently  ranged  between 
0.40  and  0.50.  None  of  the  scores  of  the  permutation  replications  on  the  confidence  test 
statistic  exceeded  any  of  the  results  obtained  on  the  actual  data  set.  Therefore,  the  null 
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hypothesis  that  the  rules  induced  under  genetic  programming  do  not  differ  from  those 
randomly  induced  is  rejected.  The  consistency  in  the  results  of  the  permutation  replications 
is  mostly  likely  due  to  large  number  of  examples  (8124)  and  the  distribution  of  the  small 
number  of  classes  (3)  in  the  data  set. 


Sample  1 

Sample  2 

Rule 

If  Bruises  =  True,  Then  Edible 

If  Odor  =  Foul  or  Gill  Spacing 

>=  5.93  and  Gill  Spacing  <= 

62.05,  Then  Edible 

Actual  #  of  Tuples  in  Data  Set 

8124 

8124 

Actual  #  Left  Hand  Side 

(Attributes) 

matched  in  Data  Set 

3376 

6812 

Actual  #  Right  Hand  Side 

(Classification) 

matched  in  Data  Set 

2264 

2430 

Actual  #  of  Misclassified 

Records. 

1112 

4382 

Confidence  of  the  Rule  (%) 

67.0  % 

35.6  % 

Actual  #  of  Randomization 

Trials  >=  Confidence  of  the 

Rule 

0 

2000 

Actual  #  of  Randomization 

Trials  <  Confidence  of  the  Rule 

2000 

0 

Hypothesis  Conclusion  at 

Significance  Level  p  =  .05 

Reject  null  hypothesis 

Fail  to  reject  null  hypothesis 

Table  5-1 


Rule 

Number 


LHS  matched 


#  misclassified 


Confidence  ( %) 


1 

2372 

144 

93.9 

2 

556 

44 

92.0 

3 

2512 

288 

4 

3188 

808 

74.6 

5 

828 

228 

72.4 

6 

2480 

720 

70.9 

7 

2304 

144 

93.7 

8 

512 

64 

87.5 

9 

1872 

576 

69.2 

10 

1024 

120 

11 

3376 

1112 

67.0 

12 

1640 

288 

82.4 

Table  5-2 


Rule 

Count  >=  .50 

Count  <  .50  and  >=  .40 

Count  <  .40 

Number 

1 

95 

1905 

0 

2 

0 

1830 

170 

3 

30 

1970 

0 

4 

21 

1979 

0 

5 

371 

1629 

0 

6 

73 

1927 

0 

7 

110 

1890 

0 

8 

509 

1491 

0 

9 

130 

1870 

0 

10 

190 

1810 

0 

11 

0 

1979 

21 

12 

75 

1925 

0 

Table  5-3 
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The  second  set  of  rules  tested  were  also  from  the  mushroom  data  set.  However,  this 
set  of  eight  rules  were  selected  from  the  first  generation  of  rules  produced  from  the  NPSGP 
program  and  are  generally  of  lower  quality  on  the  confidence  fitness  function.  The 
performance  of  this  set  of  rules  on  the  actual  data  set  is  presented  in  Table  5-4.  The  exact 
syntax  of  the  rules  and  the  C+-i-  program  used  to  test  the  rules  is  included  in  Appendix  D. 


#  LHS  matched 

#  misclassified 

Confidence  (%) 

1072 

400 

62.7 

1872 

576 

69.2 

1202 

410 

65.9 

1048 

291 

72.2 

5612 

2505 

55.4 

7004 

3344 

52.3 

4936 

2232 

54.8 

6476 

3369 

48.0 

Table  5-4 

Table  5-5  lists  the  results  obtained  from  2000  permutation  replications. 


Rule 

Count  >=  .50 

Count  <  50  and 

Count  <  .40 

Number 

>=.40 

1 

120 

1880 

0 

2 

140 

1860 

0 

3 

0 

1739 

261 

4 

0 

1748 

252 

5 

0 

2000 

0 

6 

0 

1998 

2 

7 

0 

1997 

3 

8 

0 

1998 

2 

Again,  no  permutation  replications  had  test  statistics  exceeding  those  obtained  on  the  actual 
data  set;  as  per  the  test  criteria,  the  null  hypothesis  was  rejected  for  all  eight  rules  in  this 
sample. 

2.  Testing  of  Rules  from  the  Zoo  Data  Set 

The  third  set  of  rules  tested  for  significance  were  from  the  zoo  data  set.  The  rules 
selected  were  technically  interesting  rules  induced  from  the  data  set  which  were  of  high 
quality  based  on  the  confidence  fitness  function.  The  performance  of  the  rules  tested  on  the 
actual  data  set  is  presented  in  Table  5-6.  The  actual  syntax  of  the  rule  and  the  C++  program 
used  to  test  the  rules  is  included  as  Appendix  E. 


Table  5-6 

The  performance  of  the  rule  set  on  2000  permutation  replications  is  presented  in  Table  5-7. 
No  scoring  statistic  from  any  one  permutation  replication  exceeded  the  score  of  any  rule  on 
the  actual  data  set.  Again,  the  null  hypothesis  was  rejected  for  the  eight  rules.  The  scoring 
statistics  on  the  permutation  replications  exhibited  more  variation  than  those  on  the 
mushroom  data  set.  It  is  speculated  that  this  is  due  to  the  smaller  number  of  examples  in  the 
zoo  data  set  as  compared  to  the  mushroom  data  set  ( 101  vs.  8124)  and  the  larger  number  of 
classes  (  7  vs.  3)  present. 


B.  HYPOTHESIS  CONCLUSION 

The  null  hypothesis,  that  the  rules  induced  by  NPSGP  perform  no  differently  from 
rules  resulting  from  random  processes,  was  rejected  in  all  cases.  In  no  cases  did  test  statistics 
from  permutation  replications  exceed  those  obtained  from  the  actual  data  set.  Overall, 
52,000  permutation  replications  were  generated  to  test  the  significance  of  the  26  rules 
selected  for  testing.  In  the  vast  majority  of  cases,  the  confidence  test  statistic  for  the 
permutation  replications  were  significantly  lower  than  those  of  the  actual  test  results. 
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VI.  CONCLUSIONS  AND  RECOMMENDATIONS 

A.  CONCLUSIONS 

The  major  objective  of  this  study  was  to  examine  the  problems  and  issues  associated 
with  statistical  hypothesis  testing  of  rules  induced  from  data  mining  tools.  Data  mining  tools 
will  increasingly  play  a  large  part  in  analyzing  the  enormous  amount  of  data  stored  within 
DOD.  Data  mining  tools  such  as  NPSGP  can  produce  potentially  enormous  amounts  of 
rules  which  increase  the  chance  that  a  rule  based  on  random  variation  in  the  data  will  be 
accepted  as  a  valid  rule.  Therefore,  it  is  imperative  that  some  means  be  available  for  the 
testing  of  induced  rules. 

Randomization  testing  represents  an  attractive  testing  method  for  large  number  of 
rules  based  on  its  freedom  from  assumptions  made  in  conventional  parametric  tests.  It 
unquestionably  produces  accurate  results  when  the  methodology  is  correctly  applied.  Thus, 
it  is  recommended  for  use  in  circumstances  where  the  statistical  validity  of  the  rule  in 
question  must  be  established  to  a  high  degree  of  accuracy. 

All  of  the  twenty  six  rules  tested  which  were  induced  from  NPSGP  outscored  the  test 
statistics  on  the  confidence  fitness  measure  produced  by  two  thousand  permutation 
replications.  The  null  hypothesis  that  NPSGP  rules  perform  no  better  than  rules  produced 
by  a  random  process  was  rejected  for  all  rules.  The  empirical  results  obtained  from  testing 
the  NPSGP  rules  showed  that  NPSGP’ s  clearly  produces  rules  which  are  better  than 
produced  by  a  random  process. 
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The  empirical  results  also  indicated,  as  expected,  that  data  sets  with  a  larger  number 
of  classes  and  a  small  number  of  examples  in  the  data  set  will  experience  more  variation  in 
the  scoring  on  the  confidence  fitness  measure.  Data  sets  with  the  above  mentioned 
characteristics  will  most  likely  have  more  chance  for  random  variations  in  the  data  which 
could  appear  as  valid  patterns  in  the  data  set. 

B.  RECOMMENDATIONS 

More  research  needs  to  be  conducted  on  real  life  data  sets  to  determine  if  statistical 
hypothesis  testing  is  warranted  for  all  data  sets,  especially  with  data  sets  with  small  numbers 
of  classes.  If  the  existing  rule  induction  methods  always  perform  better  than  can  be  expected 
by  random,  there  is  no  need  to  conduct  the  testing. 

One  of  the  problems  encountered  during  the  study  was  the  computational  costs  of 
conducting  the  randomization  testing.  For  example,  the  running  time  of  permutation 
replications  for  the  testing  of  eight  rules  in  the  study  took  approximately  six  hours  on  a  Sun 
SPARC- 10  workstation.  Additional  researeh  could  be  conducted  to  optimize  methods  to 
conduct  the  testing  so  that  running  times  are  lessened. 
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APPENDIX  A.  MUSHROOM  DATA  SET 


Mushroom  Data  Set  Information 

Sources: 

(a)  Mushroom  records  drawn  from  The  Audubon  Society  Field  Guide  to  North 
American  Mushrooms  (1981).  G.  H.  Lincoff  (Pres.),  New  York:  Alfred 

A.  Knopf 

(b)  Donor  to  UC-  Irvine:  Jeff  Schlimmer  (Jeffrey.Schlimmer@a.gp.cs.cmu.edu) 
Date:  27  April  1987 

Past  Usage: 

1.  Schlimmer,  J.S.  (1987).  Concept  Acquisition  Through  Representational 
Adjustment  (Technical  Report  87-19).  Doctoral  disseration.  Department 
of  Information  and  Computer  Science,  University  of  California,  Irvine. 

—  STAGGER:  asymptoted  to  95%  classification  accuracy  after  reviewing 

1000  instances. 

2.  Iba,W.,  Wogulis,J.,  «&  Langley ,P.  (1988).  Trading  off  Simplicity 
and  Coverage  in  Incremental  Concept  Learning.  In  Proceedings  of 
the  5th  International  Conference  on  Machine  Learning,  73-79. 

Ann  Arbor,  Michigan:  Morgan  Kaufmann. 

—  approximately  the  same  results  with  their  HILLARY  algorithm 

Relevant  Information: 

This  data  set  includes  descriptions  of  h5q)othetical  samples 
corresponding  to  23  species  of  gilled  mushrooms  in  the  Agaricus  and 
Lepiota  Family  (pp.  500-525).  Each  species  is  identified  as 


53 


definitely  edible,  definitely  poisonous,  or  of  unknown  edibility  and 
not  recommended.  This  latter  class  was  combined  with  the  poisonous 
one.  The  Guide  clearly  states  that  there  is  no  simple  rule  for 
determining  the  edibility  of  a  mushroom;  no  rule  like  "leaflets 
three,  let  it  be"  for  Poisonous  Oak  and  Ivy. 

Mushroom  Data  Set  Attributes: 

Classes:  Edible,  Poisonous,  Z 

Cap  Shape:  Bell,  Conical,  Convex,  Flat,  Knobbed,  Sunken 
Cap  Surface:  Fibrous,  Grooved,  Scaly,  Smooth 

Cap  Color:  Brown,  Buff,  Cinnamon,  Gray,  Green,  Pink  Purple,  Red,  White,  Yellow 
Bruises:  True,  False  . 

Odor:  Almond,  Anise,  Creosote,  Foul,  Musty,  Pungent,  None 

Gill  Attachment:  Attached,  Descending,  Free,  Notched 

Gill  Spacing:  Close  (0),  Crowded  (50),  Distant  (100) 

Gill  Size:  Broad,  Narrow 

Gill  Color:  Black,  Brown,  Buff,  Chocolate,  Gray,  Green,  Orange,  Pink,  Purple,  Red, 
White,  Yellow 

Stalk  Shape:  Enlarging,  Tapering 

Stalk  Root:  Bulbous,  Club,  Equal,  Rhizomorphs,  Rooted,  Missing 

Stalk  Surface  Above  Ring:  Fibrous,  Scaly,  Silky,  Smooth 

Stalk  Surface  Below  Ring:  Fibrous,  Scaly,  Silky,  Smooth 

Stalk  Color  Above  Ring:  Brown,  Buff,  Cinnamon,  Gray,  Orange,  Pink,  Red 

Stalk  Color  Below  Ring:  Brown,  Buff,  Cinnamon,  Gray,  Orange,  Pink,  Red,  White, 

Yellow 

Veil  Type:  Partial,  Universal 

Veil  Color:  Brown,  Orange,  White,  Yellow 

Ring  Number:  None  (0),  One  (50),  Two(lOO) 
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Ring  Type:  Cobwebby,  Evanescent,  Flaring,  Large,  Pendant,  Sheathing,  Zone,  None 

Spore  Print  Color:  Black,  Brown,  Buff,  Chocolate,  Green,  Orange,  Purple,  White, 
Yellow 

Population:  Abundant,  Clustered,  Numerous,  Scattered,  Several,  Solitary 

Habitat:  Grasses,  Leaves,  Meadows,  Paths,  Urban,  Waste,  Woods 

Examples:  8124 

Distribution  of  Classes:  Edible  (3356),  Poisonous  (3916),  Z  (852) 
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APPENDIX  B.  ZOO  DATA  SET 


Relevant  Information:  A  simple  database  containing  17  boolean  -valued  attributes.  The 
“type”  attribute  is  the  class  attribute  which  corresponds  to  a  classification  of  the  animals. 

Class  Number/  Set  of  Animals: 

1.  (41)  aardvark,  antelope,  bear,  boar,  buffalo,  calf,  cavy,  cheetah,  deer,  dolphin,  elephant, 
fruitbat,  giraffe,  girl,  goat,  gorilla,  hamster,  hare,  leopard,  lion,  leopard,  lynx,  mink,  mole, 
mongoose,  opossum,  oryx,  playtpus,  polecat,  pony,  porpoise,  puma,  pussycat,  raccoon, 
reindeer,  seal,  sealion,  squirrel,  vampire,  vole,  wallaby,  wolf 

2.  (20)  chicken,  crow,  dove,  duck,  flamingo,  gull,  hawk,  kiwi,  lark,  ostrich,  parakeet, 
penguin,  pheasant,  rhea,  skimmer,  skua,  sparrow,  swan,  vulture,  wren 

3.  (5)  pitviper,  sea  snake,  slowworm,  tortoise,  tuatara 

4.  (13)  bass,  carp,  catfish,  chub,  dogfish,  haddock,  herring,  pike,  piranha,  seahorse,  sole, 
stingray,  tuna 

5.  (4)  frog,  frog,  newt,  toad 

6.  (8)  flea,  gnat,  honeybee,  housefly,  ladybird,  moth,  termite,  wasp 

7.  (10)  clam,  crayfish,  octopus,  scorpion,  seawasp,  slug,  starfish,  worm 
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Attribute  Information: 


Number  of  Attributes:  18  (animal  name,  15  boolean  attributes,  2  numeric) 


Attributes: 

1.  Animal  Name: 

2. hair: 

3.  feathers: 

4.  eggs: 

5.  milk: 

6.  airborne: 

7.  aquatic: 

8.  predator: 

9.  toothed: 

10.  backbone: 

1 1.  breathes: 

12.  venomous: 

13.  fins: 

14.  legs: 

15.  tail: 

16.  domestic: 

17.  catsize: 

18.  type: 


Unique  for  each  instance 

Boolean 

Boolean 

Boolean 

Boolean 

Boolean 

Boolean 

Boolean 

Boolean 

Boolean 

Boolean 

Boolean 

Boolean 

Numeric  (set  of  values:  0,2,4,6,8) 

Boolean 

Boolean 

Boolean 

Numeric  (integer  values  in  range  [1..7]) 


Number  of  Examples:  101 
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APPENDIX  C.  RULES  TESTING  PROGRAM  1  (MUSHROOM) 


#include  <iostream.h> 
#include  <fstream.h> 
#include  <iomanip.h> 
#include  <stdlib.h> 


int  rulel_LHS_count  =  0; 
int  rulel_miss_records  =  0; 
int  rule2_LHS_count  =  0; 
int  rule2_miss_records  =  0; 
int  rule3_LHS_count  =  0; 
int  rule3_niiss_records  =  0; 
int  rule4_LHS_count  =  0; 
int  rule4_niiss_records  =  0; 
int  rule5_LHS_count  =  0; 
int  rule5_miss_records  =  0; 
int  rule6_LHS_count  =  0; 
int  rule6_niiss_records  =  0; 
int  rule7_LHS_count  =  0; 
int  rule7_niiss_records  =  0; 
int  rule8_LHS_count  =  0; 
int  rule8_niiss_records  =  0; 
int  rule9_LHS_count  =  0; 
int  ruIe9_nuss_records  =  0; 
int  rulelO_LHS_count  =  0; 
int  rulelO_miss_records  =  0; 
int  rulel  l_LHS_count  =  0; 
int  rulel  l_miss_records  =  0; 
int  rulel2_LHS_count  =  0; 
int  rulel2_miss_records  =  0; 


char  rule_namel[]  =  "Rule  1"; 
char  rule_name2[]  =  "Rule  2"; 
char  rule_name3[]  =  "Rule  3"; 
char  rule_name4[]  =  "Rule  4"; 
char  rule_name5[]  =  "Rule  5"; 
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char  rule_name6[]  =  "Rule  6"; 
char  rule_name7[]  =  "Rule  7"; 
char  rule_name8[]  =  "Rule  8"; 
char  rule_naine9[]  =  "Rule  9"; 
char  rule_namelO[]  =  "Rule  10"; 
char  rule_namel  1[]  =  "Rule  11"; 
char  rule_namel2[]  =  "Rule  12"; 


int  records_read=0; 

void  output(char  [],  int,  int);  //  function  prototype 
main() 

{ 

ifstream  inClientFile("mushroom_spa",  ios::in); 
if  (linClientFile)  { 

cerr  «  "File  could  not  be  opened  "«  endl; 
exit(l);  } 


char  classification; 

char  cap_shape; 

char  cap_surface; 

char  cap_color; 

char  bruises; 

char  odor; 

char  gill_attach; 

int  gill_spacing; 

char  gill_size; 

char  gill_color; 

char  stalk_shape; 

char  stalk_root; 

char  stalk_surf_abv_ring; 

char  stalk_surf_blw_ring; 

char  stalk_color_abv_ring; 

char  stalk_color_blw_ring; 

char  veil_type; 

char  veil_color; 


60 


int  ring_number; 
char  ring_type; 
char  spore_print_color; 
char  population; 
char  habitat; 


while  (inClientFile  »  classification  »  cap_shape  »  cap_surface  »  cap_color 
»  bruises  »  odor  »  gill_attach  »  gilLspacing  »  gill_size  »  gill_color  » 
stalk_shape  »  stalk_root »  stalk_surf_abv_ring  »  stalk_surf_blw_ring  » 
stalk_color_abv_ring  »  stalk_color_blw_ring  »  veil_type  »  veil_color 
»  ring_number  »  ring_type  »  spore_print_eolor  »  population  »  habitat) 


{ 


records_read  =  records_read  +1; 

//test  rule  1 

if  ( stalk_surf_abv_ring  ==  'k') 

{  rulel_LHS_count  =  rulel_LHS_eount+  1;} 


if  (  stalk_surf_abv_ring  ==  'k'  &&  classification  !=  'p') 
{  rulel_miss_records  =  rulel_niiss_records  +1;} 


//  test  rule  2 
if  ( stalk_root  ==  'c') 

{  rule2_LHS_count  =  rule2_LHS_count  +1;} 

if  (  stalk_root  ==  'c'  &&  classification  !=  'e') 

{  rule2_miss_records  =  rule2_niiss_records  +1;} 


//  test  rule  3 
if  ( gilLsize  ==  'n') 
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{  rule3_LHS_count  =  rule3_LHS_count  +1;} 


if  ( gilLsize  ==  'n'  &&  classification  !=  'p') 

{  rule3_miss_records  =  rule3_niiss_records  +1;} 

//  test  rule  4 

if  (  stalk_surf_blw_ring  !=  's') 

{  rule4_LHS_count  =  rule4_LHS_count  +1;} 

if  (  stalk_surf_blw_ring  !=  's'  &&  classification  !=  'p') 
{  rule4_miss_records  =  rule4_miss_records  +1;} 

//  test  rule  5 
if  ( cap_shape  ==  'k') 

{  rule5_LHS_count  =  rule5_LHS_count  +  I ; } 

if  (cap_shape  ==  'k'  &&  classification  !=  'p') 
{rute5_niiss_records  =  rule5_miss_records  +1;} 


//  test  rule  6 
if  (stalk_root  ==  'x') 

{  rule6_LHS_count  =  rule6_LHS_count  +1;} 

if  (stalk_root  ==  'x'  &&  classification  !=  'p') 

{  rule6_niiss_records  =  rule6_niiss_records  +1;} 


//  test  rule  7 

if  (  stalk_surf_blw_ring  ==  'k') 

{  rule7_LHS_count  =  rule7_LHS_count  +  1 ; } 


if  (stalk_surf_blw_ring  ==  'k'  &&  classification  !=  'p') 
{  rule7_miss_records  =  rule7_miss_records  +  1 ; } 
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//  test  rule  8 


if  (  stalk_color_blw_ring  ==  'n') 

{  rule8_LHS_count  =  rule8_LHS_count  +1;} 

if  (  stalk_color_blw_ring  ==  'n'  &&  classification  !=  'p') 
{  rule8_niiss_records  =  rule8_miss_records  +  1;} 


//  test  rule  9 

if  (  stalk_color_blw_ring  ==  'p'  &&  !(ring_number  >=  9.76  && 
ring_number  <=  1 1 .638)) 

{  rule9_LHS_count  =  rule9_LHS_count  +  1;} 


if  (  stalk_color_blw_ring  ==  'p'  &&  !(ring_number  >=  9.76  && 
ring_number  <=  1 1.638)  &&  (classification  !=  'p')) 

{  rule9_miss_records  =  rule9_miss_records  +  1 ; } 

//  test  rule  10 

if  (  gilLsize  ==  'n'  &&  cap_shape  ==  'x') 

{  rulelO_LHS_count  =  rulelO_LHS_count  +1;} 

if  ( gill_size  ==  'n'  &&  cap_shape  ==  'x'  &&  classification  !=  'p') 
{  rulelO_iniss_records  =  rulelO_niiss_records  +  1;} 

//  test  rule  1 1 

if  ( bruises  ==  't') 

{  rulel l_LHS_count  =  rulel l_LHS_count  +  1;} 

if  (bruises  ==  't'  &&  classification  !=  'e') 

{  rulel l_miss_records  =  rulel l_iniss_records  +1;} 
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//  test  rule  12 


if  (  gilLsize  ==  'n'  &&  stalk_color_abv_ring  ==  'w') 

{  rule  1 2_LHS_count  =  rule  1 2_LHS_count  +  1 ; } 

if  ( gilLsize  ==  'n'  &&  stalk_color_abv_ring  ==  'w'  && 
classification  !=  'p') 

{  rulel2_miss_records  =  rulel2_iniss_records  +  1;} 


} 

{cout «  "Number  of  records  read  in: " «  records_read  «  endl; 
cout «  endl; 

output(  rule_namel,  rulel_LHS_count,  rulel_niiss_records); 


output(  rule_name2,  rule2_LHS_count,  rule2_miss_records); 

output(  rule_name3,  rule3_LHS_count,  rule3_nuss_records); 
output(  rule_name4,  rule4_LHS_count,  rule4_miss_records); 
output(  rule_name5,  rule5_LHS_count,  rule5_miss_records); 
output  ( rule_name6,  rule6_LHS_count,  rule6_niiss_records); 
output(  rule_name7,  rule7_LHS_count,  rule7_miss_records); 
output  ( rule_name8,  rule8_LHS_count,  rule8_miss_records); 
output  ( rule_name9,  rule9_LHS_count,  rule9_miss_records); 
output  ( rule_namelO,  rulelO_LHS_count,  rulelO_niiss_records); 
output  (rule_namel  1,  rulel  l_LHS_count,  rulel  l_miss_records); 
output  (rule_namel2,  rulel2_LHS_count,  rulel2_miss_records);} 


return  0;  } 


lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll 
void  output(  char  rule_id[],  int  LHS.matches  ,  int  misclassed_records) 

{  float  rule_conf  =  1.0  -  ((float)  misclassed_records/LHS_matches); 

cout  «  setiosflags(ios::left)  «rule_id  «  "  "«  setw(5)  «  LHS_matches 
«  "  "  «  setw(5)  «  misclassed_records  «  "  "  «  setprecision(3)  « 
rule_conf  «  endl;  } 
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APPENDIX  D.  RULES  TESTING  PROGRAM  2  (MUSHROOM) 


#include  <iostream.h> 
#include  <fstream.h> 
#include  <iomanip.h> 
#include  <stdlib.h> 


int  rulel_LHS_count  =  0; 
int  rulel_miss_records  =  0; 
int  rule2_LHS_count  =  0; 
int  rule2_miss_records  =  0; 
int  rule3_LHS_count  =  0; 
int  rule3_miss_records  =  0; 
int  rule4_LHS_count  =  0; 
int  rule4_miss_records  =  0; 
int  rule5_LHS_count  =  0; 
int  rule5_miss_records  =  0; 
int  rule6_LHS_count  =  0; 
int  rule6_miss_records  =  0; 
int  rule7_LHS_count  =  0; 
int  rule7_miss_records  =  0; 
int  rule8_LHS_count  =  0; 
int  rule8_miss_records  =  0; 


charrule_namel[]  =  "Rule  1"; 
char  rule_name2[]  =  "Rule  2"; 
char  rule_name3[]  =  "Rule  3"; 
char  rule_name4[]  =  "Rule  4"; 
char  rule_nanie5[]  =  "Rule  5"; 
char  rule_name6[]  =  "Rule  6"; 
char  rule_name7[]  =  "Rule  7"; 
char  rule_name8[]  =  "Rule  8"; 

int  records_read=0; 

void  output(char  [],  int,  int);  //  function  prototype 
main() 
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{ 

ifstream  inClientFile("combo_file",  ios::in); 
if  (linClientFile)  { 

cerr  «  "File  could  not  be  opened  "«  endl; 
exit(l);  } 


char  classification; 

char  cap_shape; 

char  cap_surface; 

char  cap_color; 

char  bruises; 

char  odor; 

char  gill_attach; 

int  gill_spacing; 

char  gill_size; 

char  gill_color; 

char  stalk_shape; 

char  stalk_root; 

char  stalk_surf_abv_ring; 

char  stalk_surf_blw_ring; 

char  stalk_color_abv_ring; 

char  stalk_color_blw_ring; 

char  veil_type; 

char  veil_color; 

int  ring_number; 

char  ring_type; 

char  spore_print_color; 

char  population; 

char  habitat; 


while  (inClientFile  »  classification  »  cap_shape  »  cap_surface  »  cap_color 
»  bruises  »  odor  »  gill_attach  »  gill_spacing  »  gill_size  »  gill_color  » 
stalk_shape  »  stalk_root »  stalk_surf_abv_ring  »  stalk_surf_blw_ring  » 
stalk_color_abv_ring  »  stalk_color_blw_ring  »  veil_type  »  veil_color 
»  ring_number  »  ring_type  »  spore_print_color  »  population  »  habitat) 
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records_read  =  records_read  +1; 

//  test  rule  1 

if  ( cap_color  ==  'y') 

{  rulel_LHS_count  =  rulel_LHS_count  +  1;} 


if  ( cap_color  ==  'y'  &&  classification  !=  'p') 

{  rulel_niiss_records  =  rulel_niiss_records  +  1;} 


//  test  rule  2 

if  ( stalk_color_abv_ring  ==  'p') 

{  rule2_LHS_count  =  rule2_LHS_count  +1;} 

if  (  stalk_color_abv_ring  ==  'p'  &&  classification  !=  'p') 
{  rule2_niiss_records  =  rule2_miss_records  +1;} 


//  test  rule  3 
if  ( gill_color  ==  'w') 

{  rule3_LHS_count  =  rule3_LHS_count  +1;} 

if  ( gill_color  ==  'w'  &&  classification  !=  'e') 

{  rule3_miss_records  =  rule3_iiiiss_records  +1;} 

//  test  rule  4 

if  ( gill_color  ==  'n') 

{  rule4_LHS_count  =  rule4_LHS_count  +1;} 

if  (  gill_color  ==  'n'  &&  classification  !=  'e') 

{  rule4_niiss_records  =  rule4_miss_records+l;} 

//  test  rule  5 

if  ( gilLsize  ==  'b') 
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{  rule5_LHS_count  =  rule5_LHS_count  +1;} 

if  ( gilLsize  ==  'b'  &&  classification  !=  'e') 
{rule5_miss_records  =  rule5_miss_records  +1;} 

//  test  rule  6 

if  ( stalk_root  !=  'e') 

{  rule6_LHS_count  =  rule6_LHS_count  +1;} 

if  (stalk_root  !=  'e'  &&  classification  !=  'p') 

{  rule6_iniss_records  =  rule6_miss_records  +1;} 

//  test  rule  7 

if  ( stalk_surf_blw_ring  ==  's') 

{  rule7_LHS_count  =  rule7_LHS_count  +1;} 

if  (  stalk_surf_blw_ring  ==  's'  &&  classification  !=  'e') 

{  rule7_miss_records  =  rule7_niiss_records  +  1 ; } 

//  test  rule  8 

if  ( (stalk_color_blw_ring  ==  'p')  II  (gill_size  ==  'b')) 

{  rule8_LHS_count  =  rule8_LHS_count  +  1 ; } 

if  ( ((stalk_color_blw_ring  ==  'p')  II  (gill_size  ==  'b'))  && 
(classification  !=  'e') ) 

{  rule8_miss_records  =  rule8_miss_records  +  1;}  } 

//  end  of  testing  conditions  block 


{  cout «  "Number  of  records  read  in: "  «  records_read  «  endl; 
output(  rule_namel,  rulel_LHS_count,  rulel_miss_records); 
output(  rule_name2,  rule2_LHS_count,  rule2_niiss_records); 
output(  rule_name3,  rule3_LHS_count,  rule3_miss_records); 
output(  rule_name4,  rule4_LHS_count,  rule4_iniss_records); 
output(  rule_name5,  rule5_LHS_count,  rule5_miss_records); 
output  ( rule_name6,  rule6_LHS_count,  rule6_miss_records); 
output(  rule_name7,  rule7_LHS_count,  rule7_niiss_records); 
output  (  rule_name8,  rule8_LHS_count,  rule8_miss_records);  } 
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return  0;  } 


/////////////////////////////////////////////^^^^^^ 


void  output(char  rule_id[],  int  LHS_matches  ,  int  misclassed_records) 


{  float  rule_conf  =  1.0  -  ((float)  misclassed_records/LHS_matches); 

cout «  setiosflags(ios::left)  «rule_id  « " "«  setw(5) « 
LHS_matches  « " "  «  setw(5) «  niisclassed_records  «  " "  « 
setprecision(3) «  rule_conf  «  endl;  } 
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APPENDIX  E.  RULES  TESTING  PROGRAM  (ZOO) 


#include  <iostream.h> 

#include  <fstream.h>#include  <iomanip.h> 
#include  <stdlib.h> 

int  rulel_LHS_count  =  0; 
int  rulel_miss_records  =  0; 
int  rule2_LHS_count  =  0; 
int  rule2_niiss_records  =  0; 
int  rule3_LHS_count  =  0; 
int  rule3_niiss_records  =  0; 
int  rule4_LHS_count  =  0; 
int  rule4_niiss_records  =  0; 
int  rule5_LHS_count  =  0; 
int  rule5_niiss_records  =  0; 
int  rule6_LHS_count  =  0; 
int  rule6_niiss_records  =  0; 
int  rule7_LHS_count  =  0; 
int  rule7_miss_records  =  0; 
int  rule8_LHS_count  =  0; 
int  rule8_miss_records  =  0; 
int  rule9_LHS_count  =  0; 
int  rule9_nuss_records  =  0; 
int  rulelO_LHS_count  =  0; 
int  rulelO_niiss_records  =  0; 
int  rulel  l_LHS_count  =  0; 
int  rulel  l_miss_records  =  0; 
int  rulel2_LHS_count  =  0; 
int  rulel2_miss_records  =  0; 

char  rule_namel[]  =  "Rule  1"; 
char  rule_name2[]  =  "Rule  2"; 
char  rule_name3[]  =  "Rule  3"; 
char  rule_name4[]  =  "Rule  4"; 
char  rule_name5[]  =  "Rule  5"; 
char  rule_name6[]  =  "Rule  6"; 
char  rule_name7[]  =  "Rule  7"; 
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char  rule_name8[]  =  "Rule  8"; 
char  rule_name9[]  =  "Rule  9"; 
char  rule_namelO[]  =  "Rule  10"; 
char  rule_namel  1[]  =  "Rule  11"; 
char  rule_namel2[]  =  "Rule  12"; 

int  records_rea(i=0; 

void  output(char  [],  int,  int);  //  function  prototype 
main() 


{ 

ifstream  inClientFile("gpzoo.tab",  ios::in); 
if  (linClientFile)  { 

cerr  «  "File  could  not  be  opened  "«  endl; 
exit(l);  } 


char  name[10]; 
int  hair; 
int  feathers; 
int  eggs; 
int  milk; 
int  airborne; 
int  aquatic; 
int  predator; 
int  toothed; 
int  backbone; 
int  breathes; 
int  venomous; 
int  fins; 
int  legs; 
int  tail; 
int  domestic; 
int  catsize; 
int  classification; 
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while  (inClientFile  »  name  »  hair  »  feathers  »  eggs 
»  milk  »  airborne  »  aquatic  »  predator  » toothed  »  backbone  » 
breathes  »  venomous »  fins  » legs » tail »  domestic  »  catsize  » 
classification) 

{ 

records_read  =  records_read  +1; 


//test  rule  1 

if  (legs  >=  3.6  &&  legs  <=  5.9) 

{  rulel_LHS_count  =  rulel_LHS_count  +  1;} 


if  ( (legs  >=  3.6  &&  legs  <=  5.9)  &&  (classification  !=  1)) 
{  rulel_miss_records  =  rulel_miss_records  +1;} 


//  test  rule  2 
if  ( hair  ==  1) 

{  rule2_LHS_count  =  rule2_LHS_count  +1;} 

if  ( hair  ==  1  &&  classification  !=  1 ) 

{  rule2_miss_records  =  rule2_miss_records  +1;} 


//  test  rule  3 
if  (  fins  ==  1) 

{  rule3_LHS_count  =  rule3_LHS_count  +1;} 


if  (  fins  ==  1  &&  classification  !=  4) 

{  rule3_miss_records  =  rule3_miss_records  +1;} 

//  test  rule  4 
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if  ( catsize  ==  1) 

{  rule4_LHS_count  =  rule4_LHS_count  +1;} 

if  ( catsize  ==  1  &&  classification  !=  1) 

{  rule4_miss_records  =  rule4_imss_records  +1;} 

//  test  rule  5 

if  ( legs  >=  4.3  &&  legs  <=  7.0) 

{  rule5_LHS_count  =  rule5_LHS_count  +1;} 

if  ( (legs  >=  4.3  &&  legs  <=  7.0)  &&  (classification  !=  6)) 
{rule5_miss_records  =  rule5_miss_records  +1;} 


//  test  rule  6 

if  ( legs  >=  1.2  &&  legs  <=  3.9) 

{  rule6_LHS_count  =  rule6_LHS_count  +1;} 

if  (( legs  >=  1.2  «&&  legs  <=  3.9)  &&  (classification  !=  2)) 
{  rule6_miss_records  =  rule6_miss_records  +1;} 


//  test  rule  7 
if  ( catsize  ==  0) 

{  rule7_LHS_count  =  rule7_LHS_count  +  1 ; } 

if  ( ( catsize  ==  0  )  &&  (classification  !=  1) ) 

{  rule7_miss_records  =  rule7_niiss_records  +  1;} 


//  test  rule  8 


if  ( eggs  ==  0) 

{  rule8_LHS_count  =  rule8_LHS_count  +  1;} 
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if  ( ( eggs  ==  0 )  &&  (classification  !=  1)) 

{  rule8_miss_records  =  rule8_nuss_records  +  1 ; } 


} 

cout «  "Number  of  records  read  in: "  «  records_read  «  endl; 

output(  rule_namel,  rulel_LHS_count,  rulel_miss_records); 
output(  rule_name2,  rule2_LHS_count,  rule2_miss_records); 
output(  rule_name3,  rule3_LHS_count,  rule3_miss_records); 
output(  rule_name4,  rule4_LHS_count,  rule4_miss_records); 
outputf  rule_name5,  rule5_LHS_count,  rule5_niiss_records); 
output  ( rule_name6,  rule6_LHS_count,  rule6_miss_records); 
outputf  rule_name7,  rule7_LHS_count,  rule7_miss_records); 
output  ( rule_name8,  rule8_LHS_count,  rule8_miss_records);} 

return  0;  } 

llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll 


void  output(  char  rule_id[],  int  LHS_matches  ,  int  misclassed_records) 


{  float  rule_conf  =  1.0  -  ((float)  misclassed_records/LHS_matches); 

cout  « setiosflags(ios::left)  «rule_id  «  " "«  setw(5)  «  LHS_matches 
«  "  "  «  setw(5)  «  misclassed_records  «  "  "  «  setprecision(3)  « 
rule_conf  «  endl;  } 
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